This is an R Markdown Notebook for analysis using data on the DC Bus System (WMATA Metrobus). The data were obtained here:
https://planitmetro.com/2016/11/16/data-download-metrobus-vehicle-location-data/
Control + Alt + Shift + m = rename in scope
Load the packages to be used.
package ‘tidyr’ was built under R version 3.2.5package ‘plyr’ was built under R version 3.2.5package ‘stringr’ was built under R version 3.2.5package ‘geosphere’ was built under R version 3.2.5
Get the Bus data.
First let’s check the working directory.
getwd()
[1] "/Users/mdturse/Desktop/Analytics/DCMetroBus"
Then, actually get the data.
The working directory was changed to /Users/mdturse/Desktop/Analytics/DCMetroBus/Bus AVL Oct 2016 inside a notebook chunk. The working directory will be reset when the chunk is finished running. Use the knitr root.dir option in the setup chunk to change the the working directory for notebook chunks.Oct03Raw
'data.frame': 620274 obs. of 17 variables:
$ Bus_ID : int 11 11 11 11 11 11 11 11 11 11 ...
$ Route : Factor w/ 266 levels "10A","10B","10E",..: 224 224 224 224 224 224 224 224 224 224 ...
$ RouteAlt : Factor w/ 14 levels "1","10","11",..: 1 1 1 1 1 1 1 1 6 6 ...
$ Route_Direction : Factor w/ 11 levels "","ANTICLKW",..: 6 6 6 6 6 6 6 6 6 6 ...
$ Stop_Sequence : int 7 7 6 3 2 8 1 1 2 3 ...
$ Stop_ID : Factor w/ 10552 levels "","1000001","1000003",..: 9682 9682 9683 9641 9640 8136 9668 9668 9796 9795 ...
$ Stop_Desc : Factor w/ 7740 levels "10TH ST + MICHIGAN AVE",..: 1346 1346 7417 7418 1346 2940 2939 2939 6926 6929 ...
$ Event_Type : int 4 5 4 4 4 3 3 4 4 4 ...
$ Event_Description: Factor w/ 3 levels "Serviced Stop ",..: 3 2 3 3 3 1 1 3 3 3 ...
$ Event_Time : Factor w/ 75354 levels "10-3-16 1:00:00 AM",..: 47380 47506 47740 47814 47864 48244 48302 48540 49086 49190 ...
$ Departure_Time : Factor w/ 75396 levels "10-3-16 1:00:00 AM",..: 47406 47554 47766 47840 47890 48270 48536 48566 49112 49216 ...
$ Dwell_Time : int 0 11 0 0 0 0 104 0 0 0 ...
$ Delta_Time : int -177 -27 24 165 25 73 719 0 74 76 ...
$ Odometer_Distance: int 43543 43543 45139 46418 50115 51074 51303 53836 55633 56163 ...
$ Latitude : num 38.8 38.8 38.8 38.8 38.8 ...
$ Longitude : num -77.2 -77.2 -77.2 -77.2 -77.2 ...
$ Heading : int 199 253 97 276 15 119 100 89 274 104 ...
Oct04Raw
'data.frame': 623427 obs. of 17 variables:
$ Bus_ID : int 11 11 11 11 11 11 11 11 11 11 ...
$ Route : Factor w/ 266 levels "10A","10B","10E",..: 225 225 225 225 225 225 225 225 225 225 ...
$ RouteAlt : Factor w/ 14 levels "1","10","11",..: 1 1 1 6 6 6 6 6 6 6 ...
$ Route_Direction : Factor w/ 9 levels "ANTICLKW","CLOCKWIS",..: 4 4 4 4 4 4 4 4 4 4 ...
$ Stop_Sequence : int 1 1 8 1 1 1 1 3 2 5 ...
$ Stop_ID : Factor w/ 10555 levels "","1000001","1000003",..: 9671 9671 8138 8138 8138 8138 8138 9798 9799 9638 ...
$ Stop_Desc : Factor w/ 7717 levels "10TH ST + MICHIGAN AVE",..: 2939 2939 2940 2940 2940 2940 2940 6906 6903 4205 ...
$ Event_Type : int 3 4 3 3 3 5 5 4 5 4 ...
$ Event_Description: Factor w/ 3 levels "Serviced Stop ",..: 1 3 1 1 1 2 2 3 2 3 ...
$ Event_Time : Factor w/ 77713 levels "10-4-16 1:00:00 AM",..: 49126 49240 50858 50908 50976 51116 51172 51714 51842 51940 ...
$ Departure_Time : Factor w/ 77739 levels "10-4-16 1:00:00 AM",..: 49209 49251 50869 50957 50987 51165 51185 51725 51933 51951 ...
$ Dwell_Time : int 79 0 59 19 59 19 1 0 40 0 ...
$ Delta_Time : int 35 36 36 246 244 255 264 159 129 139 ...
$ Odometer_Distance: int 56958 60750 69747 69971 69747 71136 71177 76520 77425 78353 ...
$ Latitude : num 38.8 38.8 38.8 38.8 38.8 ...
$ Longitude : num -77.2 -77.2 -77.2 -77.2 -77.2 ...
$ Heading : int 58 89 82 76 79 301 274 104 310 2 ...
Oct05Raw
'data.frame': 630900 obs. of 17 variables:
$ Bus_ID : int 11 11 11 11 11 11 11 11 11 11 ...
$ Route : Factor w/ 266 levels "10A","10B","10E",..: 224 224 224 224 224 224 224 224 224 224 ...
$ RouteAlt : Factor w/ 14 levels "1","10","11",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Route_Direction : Factor w/ 11 levels "ANTICLKW","CLOCKWIS",..: 5 5 5 5 5 5 5 5 5 5 ...
$ Stop_Sequence : int 1 1 3 4 5 4 3 3 7 7 ...
$ Stop_ID : Factor w/ 10543 levels "","1000001","1000003",..: 9659 9659 9632 9633 9624 9633 9632 9632 9673 9673 ...
$ Stop_Desc : Factor w/ 7725 levels "10TH ST + MICHIGAN AVE",..: 2946 2946 7403 7401 7401 7401 7403 7403 1346 1346 ...
$ Event_Type : int 3 5 4 4 3 4 3 3 5 5 ...
$ Event_Description: Factor w/ 3 levels "Serviced Stop ",..: 1 2 3 3 1 3 1 1 2 2 ...
$ Event_Time : Factor w/ 77725 levels "10-5-16 1:00:00 AM",..: 49279 49371 49899 49953 49993 50135 50221 50493 50783 50987 ...
$ Departure_Time : Factor w/ 77716 levels "10-5-16 1:00:00 AM",..: 49257 49353 49877 49931 49997 50113 50421 50489 50767 50999 ...
$ Dwell_Time : int 189 2 0 0 13 0 111 9 3 17 ...
$ Delta_Time : int 4 78 -114 -3 19 93 297 191 382 499 ...
$ Odometer_Distance: int 37932 38703 44242 44327 44645 45927 46733 47077 50461 51916 ...
$ Latitude : num 38.8 38.8 38.8 38.8 38.8 ...
$ Longitude : num -77.2 -77.2 -77.2 -77.2 -77.2 ...
$ Heading : int 316 104 276 246 229 246 345 207 109 305 ...
Oct06Raw
'data.frame': 621948 obs. of 17 variables:
$ Bus_ID : int 11 11 11 11 11 11 11 11 11 11 ...
$ Route : Factor w/ 265 levels "10A","10B","10E",..: 224 224 224 224 224 224 224 224 224 224 ...
$ RouteAlt : Factor w/ 14 levels "1","10","11",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Route_Direction : Factor w/ 9 levels "ANTICLKW","CLOCKWIS",..: 4 4 4 4 4 4 4 4 4 4 ...
$ Stop_Sequence : int 1 1 7 6 5 4 3 7 7 7 ...
$ Stop_ID : Factor w/ 10562 levels "","1000001","1000003",..: 9678 9678 9692 9693 9643 9652 9651 9692 9692 9692 ...
$ Stop_Desc : Factor w/ 7723 levels "10TH ST + MICHIGAN AVE",..: 2937 2937 1342 7400 7399 7399 7401 1342 1342 1342 ...
$ Event_Type : int 3 5 4 5 3 4 3 4 5 5 ...
$ Event_Description: Factor w/ 3 levels "Serviced Stop ",..: 1 2 3 2 1 3 1 3 2 2 ...
$ Event_Time : Factor w/ 77758 levels "10-6-16 1:00:00 AM",..: 49294 49384 49982 49998 50058 50186 50270 50518 51002 51064 ...
$ Departure_Time : Factor w/ 77792 levels "10-6-16 1:00:00 AM",..: 49305 49399 49993 50023 50091 50197 50487 50529 51065 51079 ...
$ Dwell_Time : int 148 2 0 7 11 0 103 0 26 2 ...
$ Delta_Time : int -6 64 -87 -93 31 104 303 175 497 504 ...
$ Odometer_Distance: int 37950 38726 44130 44197 44592 45935 46739 51826 51826 51838 ...
$ Latitude : num 38.8 38.8 38.8 38.8 38.8 ...
$ Longitude : num -77.2 -77.2 -77.2 -77.2 -77.2 ...
$ Heading : int 284 71 199 164 223 246 343 199 306 320 ...
Oct07Raw
'data.frame': 622894 obs. of 17 variables:
$ Bus_ID : int 11 11 11 11 11 11 11 11 11 11 ...
$ Route : Factor w/ 266 levels "10A","10B","10E",..: 224 224 224 224 224 224 224 224 224 224 ...
$ RouteAlt : Factor w/ 14 levels "1","10","11",..: 1 1 6 6 6 6 6 6 7 7 ...
$ Route_Direction : Factor w/ 7 levels "ANTICLKW","CLOCKWIS",..: 4 4 4 4 4 4 4 4 4 4 ...
$ Stop_Sequence : int 1 1 2 3 3 5 6 7 1 2 ...
$ Stop_ID : Factor w/ 10556 levels "","1000001","1000003",..: 9672 9672 9800 9799 9799 9639 9640 9641 9641 9642 ...
$ Stop_Desc : Factor w/ 7699 levels "10TH ST + MICHIGAN AVE",..: 2930 2930 6886 6889 6889 4196 4199 6887 6887 4198 ...
$ Event_Type : int 3 4 3 4 5 4 4 3 3 4 ...
$ Event_Description: Factor w/ 3 levels "Serviced Stop ",..: 1 3 1 3 2 3 3 1 1 3 ...
$ Event_Time : Factor w/ 77562 levels "10-7-16 1:00:00 AM",..: 49134 49136 51718 51756 51888 51934 52018 52044 52130 52288 ...
$ Departure_Time : Factor w/ 77649 levels "10-7-16 1:00:00 AM",..: 49193 49195 51779 51815 51953 51993 52077 52103 52189 52347 ...
$ Dwell_Time : int 153 0 1 0 3 0 0 120 120 0 ...
$ Delta_Time : int 57 56 165 270 197 201 181 189 235 288 ...
$ Odometer_Distance: int 37846 42154 56018 56611 57411 58341 59084 59787 59787 60252 ...
$ Latitude : num 38.8 38.8 38.8 38.8 38.8 ...
$ Longitude : num -77.2 -77.2 -77.2 -77.2 -77.2 ...
$ Heading : int 98 89 201 104 26 2 1 247 247 182 ...
Put the daily data together.
AllDays <- bind_rows(list(Oct03Raw, Oct04Raw, Oct05Raw, Oct06Raw, Oct07Raw),
.id = c("group")
)
Unequal factor levels: coercing to characterUnequal factor levels: coercing to characterUnequal factor levels: coercing to characterUnequal factor levels: coercing to characterUnequal factor levels: coercing to characterUnequal factor levels: coercing to character
# dim(AllDays)
str(AllDays)
'data.frame': 3119443 obs. of 18 variables:
$ group : chr "1" "1" "1" "1" ...
$ Bus_ID : int 11 11 11 11 11 11 11 11 11 11 ...
$ Route : chr "S80" "S80" "S80" "S80" ...
$ RouteAlt : Factor w/ 14 levels "1","10","11",..: 1 1 1 1 1 1 1 1 6 6 ...
$ Route_Direction : chr "LOOP" "LOOP" "LOOP" "LOOP" ...
$ Stop_Sequence : int 7 7 6 3 2 8 1 1 2 3 ...
$ Stop_ID : chr "5004572" "5004572" "5004573" "5002210" ...
$ Stop_Desc : chr "BEULAH ST + CHARLES ARRINGTON DR" "BEULAH ST + CHARLES ARRINGTON DR" "WALKER LN + #6363" "WALKER LN + BEULAH ST" ...
$ Event_Type : int 4 5 4 4 4 3 3 4 4 4 ...
$ Event_Description: Factor w/ 3 levels "Serviced Stop ",..: 3 2 3 3 3 1 1 3 3 3 ...
$ Event_Time : chr "10-3-16 6:06:47 AM" "10-3-16 6:07:50 AM" "10-3-16 6:09:47 AM" "10-3-16 6:10:24 AM" ...
$ Departure_Time : chr "10-3-16 6:06:47 AM" "10-3-16 6:08:01 AM" "10-3-16 6:09:47 AM" "10-3-16 6:10:24 AM" ...
$ Dwell_Time : int 0 11 0 0 0 0 104 0 0 0 ...
$ Delta_Time : int -177 -27 24 165 25 73 719 0 74 76 ...
$ Odometer_Distance: int 43543 43543 45139 46418 50115 51074 51303 53836 55633 56163 ...
$ Latitude : num 38.8 38.8 38.8 38.8 38.8 ...
$ Longitude : num -77.2 -77.2 -77.2 -77.2 -77.2 ...
$ Heading : int 199 253 97 276 15 119 100 89 274 104 ...
Deleting old data frames.
for (i in 3:7){
rm(list = ls(pattern = paste0("Oct0", i, "Raw")
)
)
message("Deleting Oct0", i, "Raw")
}
Deleting Oct03Raw
Deleting Oct04Raw
Deleting Oct05Raw
Deleting Oct06Raw
Deleting Oct07Raw
Updating variable types.
Then, sorting the data and adding a RowNumber (to be used for identifying rows later in the analyses.)
rm(i)
AllDays$group <- factor(AllDays$group)
AllDays$Route_Direction <- factor(AllDays$Route_Direction)
AllDays$Event_Time <- as.POSIXct(AllDays$Event_Time, format = "%m-%d-%y %I:%M:%S %p")
AllDays$Departure_Time <- as.POSIXct(AllDays$Departure_Time, format = "%m-%d-%y %I:%M:%S %p")
str(AllDays)
'data.frame': 3119443 obs. of 18 variables:
$ group : Factor w/ 5 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Bus_ID : int 11 11 11 11 11 11 11 11 11 11 ...
$ Route : chr "S80" "S80" "S80" "S80" ...
$ RouteAlt : Factor w/ 14 levels "1","10","11",..: 1 1 1 1 1 1 1 1 6 6 ...
$ Route_Direction : Factor w/ 12 levels "","ANTICLKW",..: 6 6 6 6 6 6 6 6 6 6 ...
$ Stop_Sequence : int 7 7 6 3 2 8 1 1 2 3 ...
$ Stop_ID : chr "5004572" "5004572" "5004573" "5002210" ...
$ Stop_Desc : chr "BEULAH ST + CHARLES ARRINGTON DR" "BEULAH ST + CHARLES ARRINGTON DR" "WALKER LN + #6363" "WALKER LN + BEULAH ST" ...
$ Event_Type : int 4 5 4 4 4 3 3 4 4 4 ...
$ Event_Description: Factor w/ 3 levels "Serviced Stop ",..: 3 2 3 3 3 1 1 3 3 3 ...
$ Event_Time : POSIXct, format: "2016-10-03 06:06:47" "2016-10-03 06:07:50" ...
$ Departure_Time : POSIXct, format: "2016-10-03 06:06:47" "2016-10-03 06:08:01" ...
$ Dwell_Time : int 0 11 0 0 0 0 104 0 0 0 ...
$ Delta_Time : int -177 -27 24 165 25 73 719 0 74 76 ...
$ Odometer_Distance: int 43543 43543 45139 46418 50115 51074 51303 53836 55633 56163 ...
$ Latitude : num 38.8 38.8 38.8 38.8 38.8 ...
$ Longitude : num -77.2 -77.2 -77.2 -77.2 -77.2 ...
$ Heading : int 199 253 97 276 15 119 100 89 274 104 ...
AllDays_Sorted <- arrange(AllDays,
Bus_ID,
Event_Time
) %>%
mutate(RowNum_OG = row_number() # this is useful in identify the row later on
)
rm(AllDays)
str(AllDays_Sorted)
'data.frame': 3119443 obs. of 19 variables:
$ group : Factor w/ 5 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Bus_ID : int 11 11 11 11 11 11 11 11 11 11 ...
$ Route : chr "S80" "S80" "S80" "S80" ...
$ RouteAlt : Factor w/ 14 levels "1","10","11",..: 1 1 1 1 1 1 1 1 6 6 ...
$ Route_Direction : Factor w/ 12 levels "","ANTICLKW",..: 6 6 6 6 6 6 6 6 6 6 ...
$ Stop_Sequence : int 7 7 6 3 2 8 1 1 2 3 ...
$ Stop_ID : chr "5004572" "5004572" "5004573" "5002210" ...
$ Stop_Desc : chr "BEULAH ST + CHARLES ARRINGTON DR" "BEULAH ST + CHARLES ARRINGTON DR" "WALKER LN + #6363" "WALKER LN + BEULAH ST" ...
$ Event_Type : int 4 5 4 4 4 3 3 4 4 4 ...
$ Event_Description: Factor w/ 3 levels "Serviced Stop ",..: 3 2 3 3 3 1 1 3 3 3 ...
$ Event_Time : POSIXct, format: "2016-10-03 06:06:47" "2016-10-03 06:07:50" ...
$ Departure_Time : POSIXct, format: "2016-10-03 06:06:47" "2016-10-03 06:08:01" ...
$ Dwell_Time : int 0 11 0 0 0 0 104 0 0 0 ...
$ Delta_Time : int -177 -27 24 165 25 73 719 0 74 76 ...
$ Odometer_Distance: int 43543 43543 45139 46418 50115 51074 51303 53836 55633 56163 ...
$ Latitude : num 38.8 38.8 38.8 38.8 38.8 ...
$ Longitude : num -77.2 -77.2 -77.2 -77.2 -77.2 ...
$ Heading : int 199 253 97 276 15 119 100 89 274 104 ...
$ RowNum_OG : int 1 2 3 4 5 6 7 8 9 10 ...
# View(head(AllDays_Sorted, 100))
Inspecting the values of Stop_ID, and finding that it can take the values “” (blank) and “NULL”.
View(group_by(AllDays_Sorted,
Stop_ID
) %>%
summarise(
Cnt = n()
) %>%
arrange(Stop_ID)
)
View(filter(AllDays_Sorted,
is.na(Stop_ID) |
Stop_ID == "" |
Stop_ID == "NULL"
) %>%
arrange(Stop_Desc)
)
Creating a table of distinct Stop_Desc values when Stop_ID is “” (blank) or “NULL”.
StopID_New <- filter(AllDays_Sorted,
is.na(Stop_ID) |
Stop_ID == "" |
Stop_ID == "NULL"
) %>%
select(Stop_ID, Stop_Desc) %>%
distinct() %>%
arrange(Stop_ID, Stop_Desc) %>%
mutate(StopID_New = 1:nrow(.)
)
View(StopID_New)
Creating a full updated table by filling in StopID_New for when Stop_ID is “” (blank) or NULL.
AllDays_StopIDNew <- left_join(AllDays_Sorted,
select(StopID_New,
Stop_Desc,
StopID_New
),
by = c("Stop_Desc" = "Stop_Desc")
) %>%
mutate(StopID_Clean = ifelse(is.na(StopID_New),
Stop_ID,
StopID_New
),
StopID_Indicator = factor(ifelse(is.na(StopID_New),
"ID_OK",
"ID_Bad"
)
)
)
rm(StopID_New)
rm(AllDays_Sorted)
str(AllDays_StopIDNew)
'data.frame': 3119443 obs. of 22 variables:
$ group : Factor w/ 5 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Bus_ID : int 11 11 11 11 11 11 11 11 11 11 ...
$ Route : chr "S80" "S80" "S80" "S80" ...
$ RouteAlt : Factor w/ 14 levels "1","10","11",..: 1 1 1 1 1 1 1 1 6 6 ...
$ Route_Direction : Factor w/ 12 levels "","ANTICLKW",..: 6 6 6 6 6 6 6 6 6 6 ...
$ Stop_Sequence : int 7 7 6 3 2 8 1 1 2 3 ...
$ Stop_ID : chr "5004572" "5004572" "5004573" "5002210" ...
$ Stop_Desc : chr "BEULAH ST + CHARLES ARRINGTON DR" "BEULAH ST + CHARLES ARRINGTON DR" "WALKER LN + #6363" "WALKER LN + BEULAH ST" ...
$ Event_Type : int 4 5 4 4 4 3 3 4 4 4 ...
$ Event_Description: Factor w/ 3 levels "Serviced Stop ",..: 3 2 3 3 3 1 1 3 3 3 ...
$ Event_Time : POSIXct, format: "2016-10-03 06:06:47" "2016-10-03 06:07:50" ...
$ Departure_Time : POSIXct, format: "2016-10-03 06:06:47" "2016-10-03 06:08:01" ...
$ Dwell_Time : int 0 11 0 0 0 0 104 0 0 0 ...
$ Delta_Time : int -177 -27 24 165 25 73 719 0 74 76 ...
$ Odometer_Distance: int 43543 43543 45139 46418 50115 51074 51303 53836 55633 56163 ...
$ Latitude : num 38.8 38.8 38.8 38.8 38.8 ...
$ Longitude : num -77.2 -77.2 -77.2 -77.2 -77.2 ...
$ Heading : int 199 253 97 276 15 119 100 89 274 104 ...
$ RowNum_OG : int 1 2 3 4 5 6 7 8 9 10 ...
$ StopID_New : int NA NA NA NA NA NA NA NA NA NA ...
$ StopID_Clean : chr "5004572" "5004572" "5004573" "5002210" ...
$ StopID_Indicator : Factor w/ 2 levels "ID_Bad","ID_OK": 2 2 2 2 2 2 2 2 2 2 ...
# View(tail(AllDays_StopIDNew, 500))
# View(filter(AllDays_StopIDNew,
# Stop_Desc == "METROWAY ANNNOUCEMNT CORR"
# )
# )
Lat Long stats for pulling in Zip codes later.
LL_Stats <- group_by(AllDays_StopIDNew,
StopID_Clean
) %>%
summarise(Lat_Mean = mean(Latitude, na.rm = TRUE),
Lat_Med = median(Latitude, na.rm = TRUE),
Lng_Mean = mean(Longitude, na.rm = TRUE),
Lng_Med = median(Longitude, na.rm = TRUE)
) %>%
mutate(Lat_MeaLessMed = Lat_Mean - Lat_Med,
Lng_MeaLessMed = Lng_Mean - Lng_Med,
RowNum = row_number()
)
str(LL_Stats)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 10588 obs. of 8 variables:
$ StopID_Clean : chr "1" "10" "1000001" "1000003" ...
$ Lat_Mean : num 38.9 39 38.8 38.8 38.8 ...
$ Lat_Med : num 38.9 39 38.8 38.8 38.8 ...
$ Lng_Mean : num -77.1 -77.4 -77 -77 -77 ...
$ Lng_Med : num -77.1 -77.4 -77 -77 -77 ...
$ Lat_MeaLessMed: num 8.37e-05 0.00 -5.40e-06 2.15e-05 9.27e-06 ...
$ Lng_MeaLessMed: num -8.47e-05 0.00 1.07e-05 1.62e-04 -4.82e-07 ...
$ RowNum : int 1 2 3 4 5 6 7 8 9 10 ...
summary(LL_Stats)
StopID_Clean Lat_Mean Lat_Med Lng_Mean
Length:10588 Min. :38.09 Min. :38.60 Min. :-77.45
Class :character 1st Qu.:38.85 1st Qu.:38.85 1st Qu.:-77.10
Mode :character Median :38.90 Median :38.90 Median :-77.01
Mean :38.91 Mean :38.91 Mean :-77.03
3rd Qu.:38.96 3rd Qu.:38.96 3rd Qu.:-76.95
Max. :39.19 Max. :39.19 Max. :-75.42
Lng_Med Lat_MeaLessMed Lng_MeaLessMed RowNum
Min. :-77.45 Min. :-0.8069882 Min. :-0.0102476 Min. : 1
1st Qu.:-77.10 1st Qu.:-0.0000126 1st Qu.:-0.0000167 1st Qu.: 2648
Median :-77.01 Median : 0.0000003 Median :-0.0000003 Median : 5294
Mean :-77.03 Mean :-0.0001867 Mean : 0.0003654 Mean : 5294
3rd Qu.:-76.95 3rd Qu.: 0.0000141 3rd Qu.: 0.0000148 3rd Qu.: 7941
Max. :-76.67 Max. : 0.0093569 Max. : 1.6050662 Max. :10588
View(head(arrange(LL_Stats,
Lat_MeaLessMed
),
500
)
)
View(head(arrange(LL_Stats,
desc(Lat_MeaLessMed)
),
500
)
)
View(head(arrange(LL_Stats,
Lng_MeaLessMed
),
500
)
)
View(head(arrange(LL_Stats,
desc(Lng_MeaLessMed)
),
500
)
)
Pulling in Zip Code data.
LL_StatsZips <- left_join(LL_Stats_UnqLatLng,
Zips_All,
by = c("UniqueLatLng" = "UniqueLatLng")
)
str(LL_StatsZips)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 10588 obs. of 20 variables:
$ StopID_Clean : chr "1" "10" "1000001" "1000003" ...
$ Lat_Mean : num 38.9 39 38.8 38.8 38.8 ...
$ Lat_Med : num 38.9 39 38.8 38.8 38.8 ...
$ Lng_Mean : num -77.1 -77.4 -77 -77 -77 ...
$ Lng_Med : num -77.1 -77.4 -77 -77 -77 ...
$ Lat_MeaLessMed: num 8.37e-05 0.00 -5.40e-06 2.15e-05 9.27e-06 ...
$ Lng_MeaLessMed: num -8.47e-05 0.00 1.07e-05 1.62e-04 -4.82e-07 ...
$ RowNum : int 1 2 3 4 5 6 7 8 9 10 ...
$ UniqueLatLng : chr "38.85738---77.055138" "38.962704---77.433685" "38.816044---77.017685" "38.816292---77.018036" ...
$ id : chr "1" "1" "1" "1" ...
$ adminCode2 : chr "013" "107" "001" "001" ...
$ adminCode1 : chr "VA" "VA" "DC" "DC" ...
$ adminName2 : chr "Arlington" "Loudoun" "District of Columbia" "District of Columbia" ...
$ lng : num -77.1 -77.4 -77 -77 -77 ...
$ distance : chr "0" "0" "0" "0" ...
$ countryCode : chr "US" "US" "US" "US" ...
$ postalCode : chr "22202" "20166" "20032" "20032" ...
$ adminName1 : chr "Virginia" "Virginia" "District of Columbia" "District of Columbia" ...
$ placeName : chr "Arlington" "Sterling" "Washington" "Washington" ...
$ lat : num 38.9 39 38.8 38.8 38.8 ...
View(LL_StatsZips)
Feature engineering.
Inspecting incidences of consecutive Stop_IDs. This is done because investigation showed that many conseutive events occurr at the same Stop_ID, but with various Dwell_Times, Odometer_Distances, etc. All of which affect calculations and analyses.
Create data on the runs (consecutive Stop_IDs).
StopID_Runs <- rle(AllDays_StopIDNew$StopID_Clean)
StopID_Runs$ends <- cumsum(StopID_Runs$lengths)
StopID_Runs$starts <- ifelse(is.na(lag(StopID_Runs$ends)
),
1,
lag(StopID_Runs$ends) + 1
)
str(StopID_Runs)
List of 4
$ lengths: int [1:2809529] 2 1 1 1 1 2 1 1 1 1 ...
$ values : chr [1:2809529] "5004572" "5004573" "5002210" "5002209" ...
$ ends : int [1:2809529] 2 3 4 5 6 8 9 10 11 12 ...
$ starts : num [1:2809529] 1 3 4 5 6 7 9 10 11 12 ...
- attr(*, "class")= chr "rle"
# class(StopID_Runs)
#
# StopID_Runs_df <- data.frame(unclass(StopID_Runs))
# str(StopID_Runs_df)
# class(StopID_Runs_df)
# rm(StopID_Runs_df)
Trying to link data on RunsGroups with the original data (AllDays_Sorted). The goal is to select only one record per RunsGroup - that being the record with the longest Dwell_Time.
I attempted this computation using both data.frames (dplyr) and data.tables (data.table). However, with 2,809,062 rows in one dataset and 3,119,443 rows in the other dataset, the current computation time is over 5 days…so I’m trying a different strategy to only select the first record in a run.
# Create a RunsGroup variable for each run
# StopID_Runs_df$RunsGroup <- paste0("g", seq(1:nrow(StopID_Runs_df)
# )
# )
#
# str(StopID_Runs_df)
# head(StopID_Runs_df, 25)
# tail(StopID_Runs_df, 25)
#
# StopID_Runs_df <- StopID_Runs_df %>%
# mutate(RowNum = row_number()
# )
#
# str(StopID_Runs_df)
# head(StopID_Runs_df, 25)
# tail(StopID_Runs_df, 25)
#
#
# # Converting to data.tables for, hopefully, improved performance (speed) in computation
# StopID_Runs_dt <- data.table(StopID_Runs_df)
# setkey(StopID_Runs_dt, RowNum)
# str(StopID_Runs_dt)
#
# AllDays_Sorted_dt <- data.table(AllDays_Sorted)
# setkey(AllDays_Sorted_dt, RowNum_OG)
# str(AllDays_Sorted_dt)
# # rm(AllDays_Sorted_dt)
#
#
# # Actual loop to perform the computations and link to original data (AllDays_Sorted_dt)
# GroupData <- list()
# for(i in 1:nrow(StopID_Runs_dt)
# ) {
# assign(paste0("group_", i),
# StopID_Runs_dt[RowNum == i, RunsGroup]
# )
#
# ##### The code below is the same code as above, but done with dplyr #####
#
# # assign(paste0("group_", i),
# # filter(StopID_Runs_df,
# # RowNum == i
# # ) %>%
# # select(RunsGroup)
# # )
#
# assign(paste0("group_", i, "_start"),
# StopID_Runs_dt[RowNum == i, starts]
# )
#
# assign(paste0("group_", i, "_end"),
# StopID_Runs_dt[RowNum == i, ends]
# )
#
# assign(paste0("group_", i, "_rows"),
# AllDays_Sorted_dt[RowNum_OG >= as.numeric(get(paste0("group_", i, "_start")
# )
# ) &
# RowNum_OG <= as.numeric(get(paste0("group_", i, "_end")
# )
# ),
# RunsGroup := as.character(get(paste0("group_", i)
# )
# )
# ]
#
# ##### The code below is the same as the code above, but done with dplyr #####
#
# # filter(AllDays_Sorted,
# # between(RowNum_OG,
# # as.numeric(get(paste0("group_", i, "_start")
# # )
# # ),
# # as.numeric(get(paste0("group_", i, "_end")
# # )
# # )
# # )
# # ) %>%
# # mutate(RunsGroup = as.character(get(paste0("group_", i)
# # )
# # )
# # )
# )
#
# GroupData[[i]] <- get(paste0("group_", i, "_rows"))
#
# message("Processing Group ", i, " of 2,809,062")
# }
#
#
# GroupData_df <- rbind.fill(GroupData)
# str(GroupData_df)
# head(GroupData_df)
# tail(GroupData_df)
# # rm(GroupData_df)
#
#
# group_1
# group_1_start
# group_1_end
# group_1_rows
# group_2_rows
# group_3_rows
# group_50_rows
# str(group_50_rows)
# group_2809062_rows
# GroupData[[1]]
# GroupData[[50]]
#
#
# ##### Testing Area (Below) #####
# ##### Testing Area (Below) #####
# ##### Testing Area (Below) #####
#
# # head(StopID_Runs$starts, 20)
# # head(AllDays_NewOrder$Stop_ID, 20)
# #
# #
# # dat <- as.data.frame(c(1,1,7,7,7,9,6,8,2,2,2,1,1,1,1,1))
# # colnames(dat)[1] <- "dat"
# # r <- rle(dat$dat)
# # dat$run <- rep(r$lengths, r$lengths)
# # dat$runLag <- lag(dat$run)
# # dat$cond <- rep(r$values, r$lengths)
# # dat
# # View(dat)
When consecutive Stop_ID occurrs, only take the first occurrence. This is done because the computation time to select only the record with the longest Dwell_Time for each run was too long (over 5 days).
This is probably less than ideal with regards to Dwell_Time, but should not make much difference for calculations of travel time, speed, etc.
AllDays_FirstStopID <- AllDays_StopIDNew[StopID_Runs$starts, ]
dim(AllDays_StopIDNew)
[1] 3119443 22
dim(AllDays_FirstStopID)
[1] 2809529 22
nrow(AllDays_StopIDNew) - nrow(AllDays_FirstStopID)
[1] 309914
rm(AllDays_StopIDNew)
rm(StopID_Runs)
str(AllDays_FirstStopID)
'data.frame': 2809529 obs. of 22 variables:
$ group : Factor w/ 5 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Bus_ID : int 11 11 11 11 11 11 11 11 11 11 ...
$ Route : chr "S80" "S80" "S80" "S80" ...
$ RouteAlt : Factor w/ 14 levels "1","10","11",..: 1 1 1 1 1 1 6 6 6 6 ...
$ Route_Direction : Factor w/ 12 levels "","ANTICLKW",..: 6 6 6 6 6 6 6 6 6 6 ...
$ Stop_Sequence : int 7 6 3 2 8 1 2 3 4 2 ...
$ Stop_ID : chr "5004572" "5004573" "5002210" "5002209" ...
$ Stop_Desc : chr "BEULAH ST + CHARLES ARRINGTON DR" "WALKER LN + #6363" "WALKER LN + BEULAH ST" "BEULAH ST + CHARLES ARRINGTON DR" ...
$ Event_Type : int 4 4 4 4 3 3 4 4 4 4 ...
$ Event_Description: Factor w/ 3 levels "Serviced Stop ",..: 3 3 3 3 1 1 3 3 3 3 ...
$ Event_Time : POSIXct, format: "2016-10-03 06:06:47" "2016-10-03 06:09:47" ...
$ Departure_Time : POSIXct, format: "2016-10-03 06:06:47" "2016-10-03 06:09:47" ...
$ Dwell_Time : int 0 0 0 0 0 104 0 0 0 0 ...
$ Delta_Time : int -177 24 165 25 73 719 74 76 63 69 ...
$ Odometer_Distance: int 43543 45139 46418 50115 51074 51303 55633 56163 56285 57262 ...
$ Latitude : num 38.8 38.8 38.8 38.8 38.8 ...
$ Longitude : num -77.2 -77.2 -77.2 -77.2 -77.2 ...
$ Heading : int 199 97 276 15 119 100 274 104 241 274 ...
$ RowNum_OG : int 1 3 4 5 6 7 9 10 11 12 ...
$ StopID_New : int NA NA NA NA NA NA NA NA NA NA ...
$ StopID_Clean : chr "5004572" "5004573" "5002210" "5002209" ...
$ StopID_Indicator : Factor w/ 2 levels "ID_Bad","ID_OK": 2 2 2 2 2 2 2 2 2 2 ...
Feature engineering.
Creating new variables.
AllDays_AddVars <- mutate(AllDays_FirstStopID,
Odometer_Distance_Mi = Odometer_Distance / 5280, #5,280 feet in 1 mile
Dwell_Time2 = as.numeric(Departure_Time - Event_Time),
Event_Time_Yr = as.integer(year(Event_Time)),
Event_Time_Mth = as.integer(month(Event_Time)),
Event_Time_Date = day(Event_Time),
Event_Time_Day = wday(Event_Time, label = TRUE),
Event_Time_Hr = hour(Event_Time),
Event_Time_Min = minute(Event_Time),
Event_Time_HrGroup = factor(ifelse(Event_Time_Hr < 3,
"Group0_2",
ifelse(Event_Time_Hr < 6,
"Group3_5",
ifelse(Event_Time_Hr < 9,
"Group6_8",
ifelse(Event_Time_Hr < 12,
"Group9_11",
ifelse(Event_Time_Hr < 15,
"Group12_14",
ifelse(Event_Time_Hr < 18,
"Group15_17",
ifelse(Event_Time_Hr < 21,
"Group18_20",
ifelse(Event_Time_Hr < 24,
"Group21_23"
)))))))),
levels = c("Group0_2",
"Group3_5",
"Group6_8",
"Group9_11",
"Group12_14",
"Group15_17",
"Group18_20",
"Group21_23"
),
ordered = TRUE
)
)
rm(AllDays_FirstStopID)
str(AllDays_AddVars)
'data.frame': 2809529 obs. of 31 variables:
$ group : Factor w/ 5 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Bus_ID : int 11 11 11 11 11 11 11 11 11 11 ...
$ Route : chr "S80" "S80" "S80" "S80" ...
$ RouteAlt : Factor w/ 14 levels "1","10","11",..: 1 1 1 1 1 1 6 6 6 6 ...
$ Route_Direction : Factor w/ 12 levels "","ANTICLKW",..: 6 6 6 6 6 6 6 6 6 6 ...
$ Stop_Sequence : int 7 6 3 2 8 1 2 3 4 2 ...
$ Stop_ID : chr "5004572" "5004573" "5002210" "5002209" ...
$ Stop_Desc : chr "BEULAH ST + CHARLES ARRINGTON DR" "WALKER LN + #6363" "WALKER LN + BEULAH ST" "BEULAH ST + CHARLES ARRINGTON DR" ...
$ Event_Type : int 4 4 4 4 3 3 4 4 4 4 ...
$ Event_Description : Factor w/ 3 levels "Serviced Stop ",..: 3 3 3 3 1 1 3 3 3 3 ...
$ Event_Time : POSIXct, format: "2016-10-03 06:06:47" "2016-10-03 06:09:47" ...
$ Departure_Time : POSIXct, format: "2016-10-03 06:06:47" "2016-10-03 06:09:47" ...
$ Dwell_Time : int 0 0 0 0 0 104 0 0 0 0 ...
$ Delta_Time : int -177 24 165 25 73 719 74 76 63 69 ...
$ Odometer_Distance : int 43543 45139 46418 50115 51074 51303 55633 56163 56285 57262 ...
$ Latitude : num 38.8 38.8 38.8 38.8 38.8 ...
$ Longitude : num -77.2 -77.2 -77.2 -77.2 -77.2 ...
$ Heading : int 199 97 276 15 119 100 274 104 241 274 ...
$ RowNum_OG : int 1 3 4 5 6 7 9 10 11 12 ...
$ StopID_New : int NA NA NA NA NA NA NA NA NA NA ...
$ StopID_Clean : chr "5004572" "5004573" "5002210" "5002209" ...
$ StopID_Indicator : Factor w/ 2 levels "ID_Bad","ID_OK": 2 2 2 2 2 2 2 2 2 2 ...
$ Odometer_Distance_Mi: num 8.25 8.55 8.79 9.49 9.67 ...
$ Dwell_Time2 : num 0 0 0 0 0 104 0 0 0 0 ...
$ Event_Time_Yr : int 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ...
$ Event_Time_Mth : int 10 10 10 10 10 10 10 10 10 10 ...
$ Event_Time_Date : int 3 3 3 3 3 3 3 3 3 3 ...
$ Event_Time_Day : Ord.factor w/ 7 levels "Sun"<"Mon"<"Tues"<..: 2 2 2 2 2 2 2 2 2 2 ...
$ Event_Time_Hr : int 6 6 6 6 6 6 6 6 6 6 ...
$ Event_Time_Min : int 6 9 10 10 13 14 21 21 23 23 ...
$ Event_Time_HrGroup : Ord.factor w/ 8 levels "Group0_2"<"Group3_5"<..: 3 3 3 3 3 3 3 3 3 3 ...
# group_by(AllDays_AddVars,
# Event_Time_HrGroup
# ) %>%
# summarise(Cnts = n()
# )
# View(head(filter(AllDays_AddVars,
# Event_Time_Hr == 0
# ),
# 50
# )
# )
# View(head(AllDays_AddVars, 50))
Feature engineering.
Creating more variables. Creating a BusEvent row number for future identification purposes. Then, creating various variables to analyze distance traveled and speed.
AllDays_BusDay <- group_by(AllDays_AddVars,
Bus_ID,
Event_Time_Date
) %>%
mutate(BusDay_EventNum = row_number(), # used to identify Bus movements on a particular date
Route_Lag1 = lag(Route), # used in future analyses to identify Route changes
RouteAlt_Lag1 = lag(RouteAlt), # used in future analyses to identify RouteAlt (direction) changes
Odometer_Distance_Lag1 = lag(Odometer_Distance),
Latitude_L1 = lag(Latitude),
Longitude_L1 = lag(Longitude),
# Lat_Radian = Latitude*pi/180,
# Long_Radian = Longitude*pi/180,
# Lat_Radian_L1 = lag(Lat_Radian),
# Long_Radian_L1 = lag(Long_Radian),
# accounting for potential negative distances
TravelDistance_Ft = ifelse(Odometer_Distance > Odometer_Distance_Lag1,
Odometer_Distance - Odometer_Distance_Lag1,
NA
),
TravelDistance_Mi = TravelDistance_Ft / 5280, #5,280 feet in 1 mile
# TravelDistance_Mi2 = gcd.hf(long1 = Long_Radian_L1,
# lat1 = Lat_Radian_L1,
# long2 = Long_Radian,
# lat2 = Lat_Radian
# ),
TravelDistance_Mi_Hvrs =
# ifelse((is.na(Longitude_L1) | is.na(Latitude_L1)
# ),
# NA,
distHaversine(cbind(Longitude_L1, Latitude_L1),
cbind(Longitude, Latitude)
) * 0.000621371, # 0.000621371 miles = 1 meter
# accounting for potential negative times
TravelTime_Sec = as.numeric(ifelse(Event_Time > lag(Departure_Time),
Event_Time - lag(Departure_Time),
NA
)
),
TravelTime_Hr = TravelTime_Sec / 3600, # 3,600 seconds in 1 hour
# accounting for potential negative or zero travel times
SpeedAvg_Mph = ifelse(TravelTime_Hr > 0,
TravelDistance_Mi / TravelTime_Hr,
NA
),
Start_ID = lag(StopID_Clean),
Start_Desc = lag(Stop_Desc),
StartStop_ID = ifelse(is.na(Start_ID),
paste("NULL", StopID_Clean, sep = "--"),
paste(Start_ID, StopID_Clean, sep = "--")
)
) %>%
as.data.frame()
rm(AllDays_AddVars)
str(AllDays_BusDay)
'data.frame': 2809529 obs. of 46 variables:
$ group : Factor w/ 5 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Bus_ID : int 11 11 11 11 11 11 11 11 11 11 ...
$ Route : chr "S80" "S80" "S80" "S80" ...
$ RouteAlt : Factor w/ 14 levels "1","10","11",..: 1 1 1 1 1 1 6 6 6 6 ...
$ Route_Direction : Factor w/ 12 levels "","ANTICLKW",..: 6 6 6 6 6 6 6 6 6 6 ...
$ Stop_Sequence : int 7 6 3 2 8 1 2 3 4 2 ...
$ Stop_ID : chr "5004572" "5004573" "5002210" "5002209" ...
$ Stop_Desc : chr "BEULAH ST + CHARLES ARRINGTON DR" "WALKER LN + #6363" "WALKER LN + BEULAH ST" "BEULAH ST + CHARLES ARRINGTON DR" ...
$ Event_Type : int 4 4 4 4 3 3 4 4 4 4 ...
$ Event_Description : Factor w/ 3 levels "Serviced Stop ",..: 3 3 3 3 1 1 3 3 3 3 ...
$ Event_Time : POSIXct, format: "2016-10-03 06:06:47" "2016-10-03 06:09:47" ...
$ Departure_Time : POSIXct, format: "2016-10-03 06:06:47" "2016-10-03 06:09:47" ...
$ Dwell_Time : int 0 0 0 0 0 104 0 0 0 0 ...
$ Delta_Time : int -177 24 165 25 73 719 74 76 63 69 ...
$ Odometer_Distance : int 43543 45139 46418 50115 51074 51303 55633 56163 56285 57262 ...
$ Latitude : num 38.8 38.8 38.8 38.8 38.8 ...
$ Longitude : num -77.2 -77.2 -77.2 -77.2 -77.2 ...
$ Heading : int 199 97 276 15 119 100 274 104 241 274 ...
$ RowNum_OG : int 1 3 4 5 6 7 9 10 11 12 ...
$ StopID_New : int NA NA NA NA NA NA NA NA NA NA ...
$ StopID_Clean : chr "5004572" "5004573" "5002210" "5002209" ...
$ StopID_Indicator : Factor w/ 2 levels "ID_Bad","ID_OK": 2 2 2 2 2 2 2 2 2 2 ...
$ Odometer_Distance_Mi : num 8.25 8.55 8.79 9.49 9.67 ...
$ Dwell_Time2 : num 0 0 0 0 0 104 0 0 0 0 ...
$ Event_Time_Yr : int 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ...
$ Event_Time_Mth : int 10 10 10 10 10 10 10 10 10 10 ...
$ Event_Time_Date : int 3 3 3 3 3 3 3 3 3 3 ...
$ Event_Time_Day : Ord.factor w/ 7 levels "Sun"<"Mon"<"Tues"<..: 2 2 2 2 2 2 2 2 2 2 ...
$ Event_Time_Hr : int 6 6 6 6 6 6 6 6 6 6 ...
$ Event_Time_Min : int 6 9 10 10 13 14 21 21 23 23 ...
$ Event_Time_HrGroup : Ord.factor w/ 8 levels "Group0_2"<"Group3_5"<..: 3 3 3 3 3 3 3 3 3 3 ...
$ BusDay_EventNum : int 1 2 3 4 5 6 7 8 9 10 ...
$ Route_Lag1 : chr NA "S80" "S80" "S80" ...
$ RouteAlt_Lag1 : Factor w/ 14 levels "1","10","11",..: NA 1 1 1 1 1 1 6 6 6 ...
$ Odometer_Distance_Lag1: int NA 43543 45139 46418 50115 51074 51303 55633 56163 56285 ...
$ Latitude_L1 : num NA 38.8 38.8 38.8 38.8 ...
$ Longitude_L1 : num NA -77.2 -77.2 -77.2 -77.2 ...
$ TravelDistance_Ft : int NA 1596 1279 3697 959 229 4330 530 122 977 ...
$ TravelDistance_Mi : num NA 0.302 0.242 0.7 0.182 ...
$ TravelDistance_Mi_Hvrs: num NA 0.15 0.105 0.165 0.832 ...
$ TravelTime_Sec : num NA 180 37 25 190 29 288 52 76 8 ...
$ TravelTime_Hr : num NA 0.05 0.01028 0.00694 0.05278 ...
$ SpeedAvg_Mph : num NA 6.05 23.57 100.83 3.44 ...
$ Start_ID : chr NA "5004572" "5004573" "5002210" ...
$ Start_Desc : chr NA "BEULAH ST + CHARLES ARRINGTON DR" "WALKER LN + #6363" "WALKER LN + BEULAH ST" ...
$ StartStop_ID : chr "NULL--5004572" "5004572--5004573" "5004573--5002210" "5002210--5002209" ...
# summary(AllDays_BusDay)
# View(tail(AllDays_BusDay, 50))
Inspecting for issues with StartStop_ID (where the value is either NA or contains NULL). They ONLY exist when BusDay_EventNum = 1 (which is by design). So everything looks OK.
View(group_by(AllDays_BusDay,
StartStop_ID
) %>%
summarise(
Cnt = n()
) %>%
arrange(desc(Cnt)
)
)
View(filter(AllDays_BusDay,
(is.na(StartStop_ID) |
str_detect(StartStop_ID, "NULL")
) &
BusDay_EventNum != 1
)
)
Stats (quantiles) overall for TravelDistance_Mi.
Quantiles_dt <- AllDays_BusDay %>%
mutate(TD_Mi_q2 = quantile(x = TravelDistance_Mi, probs = 0.02, na.rm = TRUE),
TD_Mi_q98 = quantile(x = TravelDistance_Mi, probs = 0.98, na.rm = TRUE),
TT_Sec_q2 = quantile(x = TravelTime_Sec, probs = 0.02, na.rm = TRUE),
TT_Sec_q98 = quantile(x = TravelTime_Sec, probs = 0.98, na.rm = TRUE),
TT_Hr_q2 = quantile(x = TravelTime_Hr, probs = 0.02, na.rm = TRUE),
TT_Hr_q98 = quantile(x = TravelTime_Hr, probs = 0.98, na.rm = TRUE)
) %>%
data.table()
Stats <- Quantiles_dt %>%
mutate(TD_Mi_Mean = mean(TravelDistance_Mi, na.rm = TRUE),
TD_Mi_Mean_F = mean(TravelDistance_Mi[TD_Mi_q2 <= TravelDistance_Mi & TravelDistance_Mi <= TD_Mi_q98],
na.rm = TRUE
),
TD_Mi_Med = median(TravelDistance_Mi, na.rm = TRUE),
TD_Mi_Med_F = median(TravelDistance_Mi[TD_Mi_q2 <= TravelDistance_Mi & TravelDistance_Mi <= TD_Mi_q98],
na.rm = TRUE
),
TD_Mi_Cnt = sum(!is.na(TravelDistance_Mi)
),
TD_Mi_Cnt_F = sum(!is.na(TravelDistance_Mi[TD_Mi_q2 <= TravelDistance_Mi & TravelDistance_Mi <= TD_Mi_q98]
)
),
TT_Sec_Mean = mean(TravelTime_Sec, na.rm = TRUE),
TT_Sec_Mean_F = mean(TravelTime_Sec[TT_Sec_q2 <= TravelTime_Sec & TravelTime_Sec <= TT_Sec_q98],
na.rm = TRUE
),
TT_Sec_Med = median(TravelTime_Sec, na.rm = TRUE),
TT_Sec_Med_F = median(TravelTime_Sec[TT_Sec_q2 <= TravelTime_Sec & TravelTime_Sec <= TT_Sec_q98],
na.rm = TRUE
),
TT_Sec_Cnt = sum(!is.na(TravelTime_Sec)
),
TT_Sec_Cnt_F = sum(!is.na(TravelTime_Sec[TT_Sec_q2 <= TravelTime_Sec & TravelTime_Sec <= TT_Sec_q98]
)
),
TT_Hr_Mean = mean(TravelTime_Hr, na.rm = TRUE),
TT_Hr_Mean_F = mean(TravelTime_Hr[TT_Hr_q2 <= TravelTime_Hr & TravelTime_Hr <= TT_Hr_q98],
na.rm = TRUE
),
TT_Hr_Med = median(TravelTime_Hr, na.rm = TRUE),
TT_Hr_Med_F = median(TravelTime_Hr[TT_Hr_q2 <= TravelTime_Hr & TravelTime_Hr <= TT_Hr_q98],
na.rm = TRUE
),
TT_Hr_Cnt = sum(!is.na(TravelTime_Hr)
),
TT_Hr_Cnt_F = sum(!is.na(TravelTime_Hr[TT_Hr_q2 <= TravelTime_Hr & TravelTime_Hr <= TT_Hr_q98]
)
)
) %>%
data.frame()
rm(AllDays_BusDay)
rm(Quantiles_dt)
str(Stats)
'data.frame': 2809529 obs. of 70 variables:
$ group : Factor w/ 5 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Bus_ID : int 11 11 11 11 11 11 11 11 11 11 ...
$ Route : chr "S80" "S80" "S80" "S80" ...
$ RouteAlt : Factor w/ 14 levels "1","10","11",..: 1 1 1 1 1 1 6 6 6 6 ...
$ Route_Direction : Factor w/ 12 levels "","ANTICLKW",..: 6 6 6 6 6 6 6 6 6 6 ...
$ Stop_Sequence : int 7 6 3 2 8 1 2 3 4 2 ...
$ Stop_ID : chr "5004572" "5004573" "5002210" "5002209" ...
$ Stop_Desc : chr "BEULAH ST + CHARLES ARRINGTON DR" "WALKER LN + #6363" "WALKER LN + BEULAH ST" "BEULAH ST + CHARLES ARRINGTON DR" ...
$ Event_Type : int 4 4 4 4 3 3 4 4 4 4 ...
$ Event_Description : Factor w/ 3 levels "Serviced Stop ",..: 3 3 3 3 1 1 3 3 3 3 ...
$ Event_Time : POSIXct, format: "2016-10-03 06:06:47" "2016-10-03 06:09:47" ...
$ Departure_Time : POSIXct, format: "2016-10-03 06:06:47" "2016-10-03 06:09:47" ...
$ Dwell_Time : int 0 0 0 0 0 104 0 0 0 0 ...
$ Delta_Time : int -177 24 165 25 73 719 74 76 63 69 ...
$ Odometer_Distance : int 43543 45139 46418 50115 51074 51303 55633 56163 56285 57262 ...
$ Latitude : num 38.8 38.8 38.8 38.8 38.8 ...
$ Longitude : num -77.2 -77.2 -77.2 -77.2 -77.2 ...
$ Heading : int 199 97 276 15 119 100 274 104 241 274 ...
$ RowNum_OG : int 1 3 4 5 6 7 9 10 11 12 ...
$ StopID_New : int NA NA NA NA NA NA NA NA NA NA ...
$ StopID_Clean : chr "5004572" "5004573" "5002210" "5002209" ...
$ StopID_Indicator : Factor w/ 2 levels "ID_Bad","ID_OK": 2 2 2 2 2 2 2 2 2 2 ...
$ Odometer_Distance_Mi : num 8.25 8.55 8.79 9.49 9.67 ...
$ Dwell_Time2 : num 0 0 0 0 0 104 0 0 0 0 ...
$ Event_Time_Yr : int 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ...
$ Event_Time_Mth : int 10 10 10 10 10 10 10 10 10 10 ...
$ Event_Time_Date : int 3 3 3 3 3 3 3 3 3 3 ...
$ Event_Time_Day : Ord.factor w/ 7 levels "Sun"<"Mon"<"Tues"<..: 2 2 2 2 2 2 2 2 2 2 ...
$ Event_Time_Hr : int 6 6 6 6 6 6 6 6 6 6 ...
$ Event_Time_Min : int 6 9 10 10 13 14 21 21 23 23 ...
$ Event_Time_HrGroup : Ord.factor w/ 8 levels "Group0_2"<"Group3_5"<..: 3 3 3 3 3 3 3 3 3 3 ...
$ BusDay_EventNum : int 1 2 3 4 5 6 7 8 9 10 ...
$ Route_Lag1 : chr NA "S80" "S80" "S80" ...
$ RouteAlt_Lag1 : Factor w/ 14 levels "1","10","11",..: NA 1 1 1 1 1 1 6 6 6 ...
$ Odometer_Distance_Lag1: int NA 43543 45139 46418 50115 51074 51303 55633 56163 56285 ...
$ Latitude_L1 : num NA 38.8 38.8 38.8 38.8 ...
$ Longitude_L1 : num NA -77.2 -77.2 -77.2 -77.2 ...
$ TravelDistance_Ft : int NA 1596 1279 3697 959 229 4330 530 122 977 ...
$ TravelDistance_Mi : num NA 0.302 0.242 0.7 0.182 ...
$ TravelDistance_Mi_Hvrs: num NA 0.15 0.105 0.165 0.832 ...
$ TravelTime_Sec : num NA 180 37 25 190 29 288 52 76 8 ...
$ TravelTime_Hr : num NA 0.05 0.01028 0.00694 0.05278 ...
$ SpeedAvg_Mph : num NA 6.05 23.57 100.83 3.44 ...
$ Start_ID : chr NA "5004572" "5004573" "5002210" ...
$ Start_Desc : chr NA "BEULAH ST + CHARLES ARRINGTON DR" "WALKER LN + #6363" "WALKER LN + BEULAH ST" ...
$ StartStop_ID : chr "NULL--5004572" "5004572--5004573" "5004573--5002210" "5002210--5002209" ...
$ TD_Mi_q2 : num 0.0521 0.0521 0.0521 0.0521 0.0521 ...
$ TD_Mi_q98 : num 0.959 0.959 0.959 0.959 0.959 ...
$ TT_Sec_q2 : num 10 10 10 10 10 10 10 10 10 10 ...
$ TT_Sec_q98 : num 349 349 349 349 349 349 349 349 349 349 ...
$ TT_Hr_q2 : num 0.00278 0.00278 0.00278 0.00278 0.00278 ...
$ TT_Hr_q98 : num 0.0969 0.0969 0.0969 0.0969 0.0969 ...
$ TD_Mi_Mean : num 0.308 0.308 0.308 0.308 0.308 ...
$ TD_Mi_Mean_F : num 0.232 0.232 0.232 0.232 0.232 ...
$ TD_Mi_Med : num 0.198 0.198 0.198 0.198 0.198 ...
$ TD_Mi_Med_F : num 0.198 0.198 0.198 0.198 0.198 ...
$ TD_Mi_Cnt : int 2486795 2486795 2486795 2486795 2486795 2486795 2486795 2486795 2486795 2486795 ...
$ TD_Mi_Cnt_F : int 2387406 2387406 2387406 2387406 2387406 2387406 2387406 2387406 2387406 2387406 ...
$ TT_Sec_Mean : num 105 105 105 105 105 ...
$ TT_Sec_Mean_F : num 56.6 56.6 56.6 56.6 56.6 ...
$ TT_Sec_Med : num 39 39 39 39 39 39 39 39 39 39 ...
$ TT_Sec_Med_F : num 39 39 39 39 39 39 39 39 39 39 ...
$ TT_Sec_Cnt : int 2802888 2802888 2802888 2802888 2802888 2802888 2802888 2802888 2802888 2802888 ...
$ TT_Sec_Cnt_F : int 2705189 2705189 2705189 2705189 2705189 2705189 2705189 2705189 2705189 2705189 ...
$ TT_Hr_Mean : num 0.0291 0.0291 0.0291 0.0291 0.0291 ...
$ TT_Hr_Mean_F : num 0.0157 0.0157 0.0157 0.0157 0.0157 ...
$ TT_Hr_Med : num 0.0108 0.0108 0.0108 0.0108 0.0108 ...
$ TT_Hr_Med_F : num 0.0108 0.0108 0.0108 0.0108 0.0108 ...
$ TT_Hr_Cnt : int 2802888 2802888 2802888 2802888 2802888 2802888 2802888 2802888 2802888 2802888 ...
$ TT_Hr_Cnt_F : int 2705189 2705189 2705189 2705189 2705189 2705189 2705189 2705189 2705189 2705189 ...
# View(head(Stats, 50))
Stats for StartStop_ID.
Quantiles_SS_dt <- group_by(Stats,
StartStop_ID
) %>%
mutate(TD_Mi_SS_q5 = quantile(x = TravelDistance_Mi, probs = 0.05, na.rm = TRUE),
TD_Mi_SS_q95 = quantile(x = TravelDistance_Mi, probs = 0.95, na.rm = TRUE),
TT_Sec_SS_q5 = quantile(x = TravelTime_Sec, probs = 0.05, na.rm = TRUE),
TT_Sec_SS_q95 = quantile(x = TravelTime_Sec, probs = 0.95, na.rm = TRUE),
TT_Hr_SS_q5 = quantile(x = TravelTime_Hr, probs = 0.05, na.rm = TRUE),
TT_Hr_SS_q95 = quantile(x = TravelTime_Hr, probs = 0.95, na.rm = TRUE)
) %>%
data.table()
Stats_StSt <- group_by(Quantiles_SS_dt,
StartStop_ID
) %>%
mutate(TD_Mi_SS_Mean = mean(TravelDistance_Mi, na.rm = TRUE),
TD_Mi_SS_Mean_F = mean(TravelDistance_Mi[TD_Mi_SS_q5 <= TravelDistance_Mi & TravelDistance_Mi <= TD_Mi_SS_q95],
na.rm = TRUE
),
TD_Mi_SS_Med = median(TravelDistance_Mi, na.rm = TRUE),
TD_Mi_SS_Med_F = median(TravelDistance_Mi[TD_Mi_SS_q5 <= TravelDistance_Mi & TravelDistance_Mi <= TD_Mi_SS_q95],
na.rm = TRUE
),
TD_Mi_SS_Cnt = sum(!is.na(TravelDistance_Mi)
),
TD_Mi_SS_Cnt_F = sum(!is.na(TravelDistance_Mi[TD_Mi_SS_q5 <= TravelDistance_Mi & TravelDistance_Mi <= TD_Mi_SS_q95]
)
),
TT_Sec_SS_Mean = mean(TravelTime_Sec, na.rm = TRUE),
TT_Sec_SS_Mean_F = mean(TravelTime_Sec[TT_Sec_SS_q5 <= TravelTime_Sec & TravelTime_Sec <= TT_Sec_SS_q95],
na.rm = TRUE
),
TT_Sec_SS_Med = median(TravelTime_Sec, na.rm = TRUE),
TT_Sec_SS_Med_F = median(TravelTime_Sec[TT_Sec_SS_q5 <= TravelTime_Sec & TravelTime_Sec <= TT_Sec_SS_q95],
na.rm = TRUE
),
TT_Sec_SS_Cnt = sum(!is.na(TravelTime_Sec)),
TT_Sec_SS_Cnt_F = sum(!is.na(TravelTime_Sec[TT_Sec_SS_q5 <= TravelTime_Sec & TravelTime_Sec <= TT_Sec_SS_q95]
)
),
TT_Hr_SS_Mean = mean(TravelTime_Hr, na.rm = TRUE),
TT_Hr_SS_Mean_F = mean(TravelTime_Hr[TT_Hr_SS_q5 <= TravelTime_Hr & TravelTime_Hr <= TT_Hr_SS_q95],
na.rm = TRUE
),
TT_Hr_SS_Med = median(TravelTime_Hr, na.rm = TRUE),
TT_Hr_SS_Med_F = median(TravelTime_Hr[TT_Hr_SS_q5 <= TravelTime_Hr & TravelTime_Hr <= TT_Hr_SS_q95],
na.rm = TRUE
),
TT_Hr_SS_Cnt = sum(!is.na(TravelTime_Hr)),
TT_Hr_SS_Cnt_F = sum(!is.na(TravelTime_Hr[TT_Hr_SS_q5 <= TravelTime_Hr & TravelTime_Hr <= TT_Hr_SS_q95]
)
)
) %>%
data.frame()
rm(Stats)
rm(Quantiles_SS_dt)
str(Stats_StSt)
'data.frame': 2809529 obs. of 94 variables:
$ group : Factor w/ 5 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Bus_ID : int 11 11 11 11 11 11 11 11 11 11 ...
$ Route : chr "S80" "S80" "S80" "S80" ...
$ RouteAlt : Factor w/ 14 levels "1","10","11",..: 1 1 1 1 1 1 6 6 6 6 ...
$ Route_Direction : Factor w/ 12 levels "","ANTICLKW",..: 6 6 6 6 6 6 6 6 6 6 ...
$ Stop_Sequence : int 7 6 3 2 8 1 2 3 4 2 ...
$ Stop_ID : chr "5004572" "5004573" "5002210" "5002209" ...
$ Stop_Desc : chr "BEULAH ST + CHARLES ARRINGTON DR" "WALKER LN + #6363" "WALKER LN + BEULAH ST" "BEULAH ST + CHARLES ARRINGTON DR" ...
$ Event_Type : int 4 4 4 4 3 3 4 4 4 4 ...
$ Event_Description : Factor w/ 3 levels "Serviced Stop ",..: 3 3 3 3 1 1 3 3 3 3 ...
$ Event_Time : POSIXct, format: "2016-10-03 06:06:47" "2016-10-03 06:09:47" ...
$ Departure_Time : POSIXct, format: "2016-10-03 06:06:47" "2016-10-03 06:09:47" ...
$ Dwell_Time : int 0 0 0 0 0 104 0 0 0 0 ...
$ Delta_Time : int -177 24 165 25 73 719 74 76 63 69 ...
$ Odometer_Distance : int 43543 45139 46418 50115 51074 51303 55633 56163 56285 57262 ...
$ Latitude : num 38.8 38.8 38.8 38.8 38.8 ...
$ Longitude : num -77.2 -77.2 -77.2 -77.2 -77.2 ...
$ Heading : int 199 97 276 15 119 100 274 104 241 274 ...
$ RowNum_OG : int 1 3 4 5 6 7 9 10 11 12 ...
$ StopID_New : int NA NA NA NA NA NA NA NA NA NA ...
$ StopID_Clean : chr "5004572" "5004573" "5002210" "5002209" ...
$ StopID_Indicator : Factor w/ 2 levels "ID_Bad","ID_OK": 2 2 2 2 2 2 2 2 2 2 ...
$ Odometer_Distance_Mi : num 8.25 8.55 8.79 9.49 9.67 ...
$ Dwell_Time2 : num 0 0 0 0 0 104 0 0 0 0 ...
$ Event_Time_Yr : int 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ...
$ Event_Time_Mth : int 10 10 10 10 10 10 10 10 10 10 ...
$ Event_Time_Date : int 3 3 3 3 3 3 3 3 3 3 ...
$ Event_Time_Day : Ord.factor w/ 7 levels "Sun"<"Mon"<"Tues"<..: 2 2 2 2 2 2 2 2 2 2 ...
$ Event_Time_Hr : int 6 6 6 6 6 6 6 6 6 6 ...
$ Event_Time_Min : int 6 9 10 10 13 14 21 21 23 23 ...
$ Event_Time_HrGroup : Ord.factor w/ 8 levels "Group0_2"<"Group3_5"<..: 3 3 3 3 3 3 3 3 3 3 ...
$ BusDay_EventNum : int 1 2 3 4 5 6 7 8 9 10 ...
$ Route_Lag1 : chr NA "S80" "S80" "S80" ...
$ RouteAlt_Lag1 : Factor w/ 14 levels "1","10","11",..: NA 1 1 1 1 1 1 6 6 6 ...
$ Odometer_Distance_Lag1: int NA 43543 45139 46418 50115 51074 51303 55633 56163 56285 ...
$ Latitude_L1 : num NA 38.8 38.8 38.8 38.8 ...
$ Longitude_L1 : num NA -77.2 -77.2 -77.2 -77.2 ...
$ TravelDistance_Ft : int NA 1596 1279 3697 959 229 4330 530 122 977 ...
$ TravelDistance_Mi : num NA 0.302 0.242 0.7 0.182 ...
$ TravelDistance_Mi_Hvrs: num NA 0.15 0.105 0.165 0.832 ...
$ TravelTime_Sec : num NA 180 37 25 190 29 288 52 76 8 ...
$ TravelTime_Hr : num NA 0.05 0.01028 0.00694 0.05278 ...
$ SpeedAvg_Mph : num NA 6.05 23.57 100.83 3.44 ...
$ Start_ID : chr NA "5004572" "5004573" "5002210" ...
$ Start_Desc : chr NA "BEULAH ST + CHARLES ARRINGTON DR" "WALKER LN + #6363" "WALKER LN + BEULAH ST" ...
$ StartStop_ID : chr "NULL--5004572" "5004572--5004573" "5004573--5002210" "5002210--5002209" ...
$ TD_Mi_q2 : num 0.0521 0.0521 0.0521 0.0521 0.0521 ...
$ TD_Mi_q98 : num 0.959 0.959 0.959 0.959 0.959 ...
$ TT_Sec_q2 : num 10 10 10 10 10 10 10 10 10 10 ...
$ TT_Sec_q98 : num 349 349 349 349 349 349 349 349 349 349 ...
$ TT_Hr_q2 : num 0.00278 0.00278 0.00278 0.00278 0.00278 ...
$ TT_Hr_q98 : num 0.0969 0.0969 0.0969 0.0969 0.0969 ...
$ TD_Mi_Mean : num 0.308 0.308 0.308 0.308 0.308 ...
$ TD_Mi_Mean_F : num 0.232 0.232 0.232 0.232 0.232 ...
$ TD_Mi_Med : num 0.198 0.198 0.198 0.198 0.198 ...
$ TD_Mi_Med_F : num 0.198 0.198 0.198 0.198 0.198 ...
$ TD_Mi_Cnt : int 2486795 2486795 2486795 2486795 2486795 2486795 2486795 2486795 2486795 2486795 ...
$ TD_Mi_Cnt_F : int 2387406 2387406 2387406 2387406 2387406 2387406 2387406 2387406 2387406 2387406 ...
$ TT_Sec_Mean : num 105 105 105 105 105 ...
$ TT_Sec_Mean_F : num 56.6 56.6 56.6 56.6 56.6 ...
$ TT_Sec_Med : num 39 39 39 39 39 39 39 39 39 39 ...
$ TT_Sec_Med_F : num 39 39 39 39 39 39 39 39 39 39 ...
$ TT_Sec_Cnt : int 2802888 2802888 2802888 2802888 2802888 2802888 2802888 2802888 2802888 2802888 ...
$ TT_Sec_Cnt_F : int 2705189 2705189 2705189 2705189 2705189 2705189 2705189 2705189 2705189 2705189 ...
$ TT_Hr_Mean : num 0.0291 0.0291 0.0291 0.0291 0.0291 ...
$ TT_Hr_Mean_F : num 0.0157 0.0157 0.0157 0.0157 0.0157 ...
$ TT_Hr_Med : num 0.0108 0.0108 0.0108 0.0108 0.0108 ...
$ TT_Hr_Med_F : num 0.0108 0.0108 0.0108 0.0108 0.0108 ...
$ TT_Hr_Cnt : int 2802888 2802888 2802888 2802888 2802888 2802888 2802888 2802888 2802888 2802888 ...
$ TT_Hr_Cnt_F : int 2705189 2705189 2705189 2705189 2705189 2705189 2705189 2705189 2705189 2705189 ...
$ TD_Mi_SS_q5 : num NA 0.0252 0.2422 0.7324 0.0794 ...
$ TD_Mi_SS_q95 : num NA 0.626 0.242 1.008 0.176 ...
$ TT_Sec_SS_q5 : num NA 11.9 37 30.5 172.9 ...
$ TT_Sec_SS_q95 : num NA 346.3 37 75.8 189.1 ...
$ TT_Hr_SS_q5 : num NA 0.00331 0.01028 0.00849 0.04803 ...
$ TT_Hr_SS_q95 : num NA 0.0962 0.0103 0.0211 0.0525 ...
$ TD_Mi_SS_Mean : num NaN 0.437 0.242 0.908 0.128 ...
$ TD_Mi_SS_Mean_F : num NaN 0.457 0.242 0.977 NaN ...
$ TD_Mi_SS_Med : num NA 0.512 0.242 0.962 0.128 ...
$ TD_Mi_SS_Med_F : num NA 0.512 0.242 1.008 NA ...
$ TD_Mi_SS_Cnt : int 0 14 1 4 2 87 22 118 91 11 ...
$ TD_Mi_SS_Cnt_F : int 0 12 1 3 0 77 18 106 81 9 ...
$ TT_Sec_SS_Mean : num NaN 215.8 37 58.2 181 ...
$ TT_Sec_SS_Mean_F : num NaN 218.9 37 65.5 NaN ...
$ TT_Sec_SS_Med : num NA 223.5 37 65.5 181 ...
$ TT_Sec_SS_Med_F : num NA 223.5 37 65.5 NA ...
$ TT_Sec_SS_Cnt : int 0 14 1 4 2 173 22 141 141 11 ...
$ TT_Sec_SS_Cnt_F : int 0 12 1 2 0 156 18 127 128 9 ...
$ TT_Hr_SS_Mean : num NaN 0.0599 0.0103 0.0162 0.0503 ...
$ TT_Hr_SS_Mean_F : num NaN 0.0608 0.0103 0.0182 NaN ...
$ TT_Hr_SS_Med : num NA 0.0621 0.0103 0.0182 0.0503 ...
$ TT_Hr_SS_Med_F : num NA 0.0621 0.0103 0.0182 NA ...
$ TT_Hr_SS_Cnt : int 0 14 1 4 2 173 22 141 141 11 ...
$ TT_Hr_SS_Cnt_F : int 0 12 1 2 0 156 18 127 128 9 ...
# View(head(Stats_StSt, 50))
Stats for StartStop_ID with Event_Time_HrGroup.
Quantiles_SSHG_dt <- group_by(Stats_StSt,
StartStop_ID,
Event_Time_HrGroup
) %>%
mutate(TD_Mi_SSHG_q5 = quantile(x = TravelDistance_Mi, probs = 0.05, na.rm = TRUE),
TD_Mi_SSHG_q95 = quantile(x = TravelDistance_Mi, probs = 0.95, na.rm = TRUE),
TT_Sec_SSHG_q5 = quantile(x = TravelTime_Sec, probs = 0.05, na.rm = TRUE),
TT_Sec_SSHG_q95 = quantile(x = TravelTime_Sec, probs = 0.95, na.rm = TRUE),
TT_Hr_SSHG_q5 = quantile(x = TravelTime_Hr, probs = 0.05, na.rm = TRUE),
TT_Hr_SSHG_q95 = quantile(x = TravelTime_Hr, probs = 0.95, na.rm = TRUE)
) %>%
data.table()
Stats_StSt_HrGrp <- group_by(Quantiles_SSHG_dt,
StartStop_ID,
Event_Time_HrGroup
) %>%
mutate(TD_Mi_SSHG_Mean = mean(TravelDistance_Mi, na.rm = TRUE),
TD_Mi_SSHG_Mean_F = mean(TravelDistance_Mi[TD_Mi_SSHG_q5 <= TravelDistance_Mi & TravelDistance_Mi <= TD_Mi_SSHG_q95],
na.rm = TRUE
),
TD_Mi_SSHG_Med = median(TravelDistance_Mi, na.rm = TRUE),
TD_Mi_SSHG_Med_F = median(TravelDistance_Mi[TD_Mi_SSHG_q5 <= TravelDistance_Mi & TravelDistance_Mi <= TD_Mi_SSHG_q95],
na.rm = TRUE
),
TD_Mi_SSHG_Cnt = sum(!is.na(TravelDistance_Mi)
),
TD_Mi_SSHG_Cnt_F = sum(!is.na(TravelDistance_Mi[TD_Mi_SSHG_q5 <= TravelDistance_Mi & TravelDistance_Mi <= TD_Mi_SSHG_q95]
)
),
TT_Sec_SSHG_Mean = mean(TravelTime_Sec, na.rm = TRUE),
TT_Sec_SSHG_Mean_F = mean(TravelTime_Sec[TT_Sec_SSHG_q5 <= TravelTime_Sec & TravelTime_Sec <= TT_Sec_SSHG_q95],
na.rm = TRUE
),
TT_Sec_SSHG_Med = median(TravelTime_Sec, na.rm = TRUE),
TT_Sec_SSHG_Med_F = median(TravelTime_Sec[TT_Sec_SSHG_q5 <= TravelTime_Sec & TravelTime_Sec <= TT_Sec_SSHG_q95],
na.rm = TRUE
),
TT_Sec_SSHG_Cnt = sum(!is.na(TravelTime_Sec)),
TT_Sec_SSHG_Cnt_F = sum(!is.na(TravelTime_Sec[TT_Sec_SSHG_q5 <= TravelTime_Sec & TravelTime_Sec <= TT_Sec_SSHG_q95]
)
),
TT_Hr_SSHG_Mean = mean(TravelTime_Hr, na.rm = TRUE),
TT_Hr_SSHG_Mean_F = mean(TravelTime_Hr[TT_Hr_SSHG_q5 <= TravelTime_Hr & TravelTime_Hr <= TT_Hr_SSHG_q95],
na.rm = TRUE
),
TT_Hr_SSHG_Med = median(TravelTime_Hr, na.rm = TRUE),
TT_Hr_SSHG_Med_F = median(TravelTime_Hr[TT_Hr_SSHG_q5 <= TravelTime_Hr & TravelTime_Hr <= TT_Hr_SSHG_q95],
na.rm = TRUE
),
TT_Hr_SSHG_Cnt = sum(!is.na(TravelTime_Hr)),
TT_Hr_SSHG_Cnt_F = sum(!is.na(TravelTime_Hr[TT_Hr_SSHG_q5 <= TravelTime_Hr & TravelTime_Hr <= TT_Hr_SSHG_q95]
)
)
) %>%
data.frame()
rm(Stats_StSt)
rm(Quantiles_SSHG_dt)
str(Stats_StSt_HrGrp)
'data.frame': 2809529 obs. of 118 variables:
$ group : Factor w/ 5 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Bus_ID : int 11 11 11 11 11 11 11 11 11 11 ...
$ Route : chr "S80" "S80" "S80" "S80" ...
$ RouteAlt : Factor w/ 14 levels "1","10","11",..: 1 1 1 1 1 1 6 6 6 6 ...
$ Route_Direction : Factor w/ 12 levels "","ANTICLKW",..: 6 6 6 6 6 6 6 6 6 6 ...
$ Stop_Sequence : int 7 6 3 2 8 1 2 3 4 2 ...
$ Stop_ID : chr "5004572" "5004573" "5002210" "5002209" ...
$ Stop_Desc : chr "BEULAH ST + CHARLES ARRINGTON DR" "WALKER LN + #6363" "WALKER LN + BEULAH ST" "BEULAH ST + CHARLES ARRINGTON DR" ...
$ Event_Type : int 4 4 4 4 3 3 4 4 4 4 ...
$ Event_Description : Factor w/ 3 levels "Serviced Stop ",..: 3 3 3 3 1 1 3 3 3 3 ...
$ Event_Time : POSIXct, format: "2016-10-03 06:06:47" "2016-10-03 06:09:47" ...
$ Departure_Time : POSIXct, format: "2016-10-03 06:06:47" "2016-10-03 06:09:47" ...
$ Dwell_Time : int 0 0 0 0 0 104 0 0 0 0 ...
$ Delta_Time : int -177 24 165 25 73 719 74 76 63 69 ...
$ Odometer_Distance : int 43543 45139 46418 50115 51074 51303 55633 56163 56285 57262 ...
$ Latitude : num 38.8 38.8 38.8 38.8 38.8 ...
$ Longitude : num -77.2 -77.2 -77.2 -77.2 -77.2 ...
$ Heading : int 199 97 276 15 119 100 274 104 241 274 ...
$ RowNum_OG : int 1 3 4 5 6 7 9 10 11 12 ...
$ StopID_New : int NA NA NA NA NA NA NA NA NA NA ...
$ StopID_Clean : chr "5004572" "5004573" "5002210" "5002209" ...
$ StopID_Indicator : Factor w/ 2 levels "ID_Bad","ID_OK": 2 2 2 2 2 2 2 2 2 2 ...
$ Odometer_Distance_Mi : num 8.25 8.55 8.79 9.49 9.67 ...
$ Dwell_Time2 : num 0 0 0 0 0 104 0 0 0 0 ...
$ Event_Time_Yr : int 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ...
$ Event_Time_Mth : int 10 10 10 10 10 10 10 10 10 10 ...
$ Event_Time_Date : int 3 3 3 3 3 3 3 3 3 3 ...
$ Event_Time_Day : Ord.factor w/ 7 levels "Sun"<"Mon"<"Tues"<..: 2 2 2 2 2 2 2 2 2 2 ...
$ Event_Time_Hr : int 6 6 6 6 6 6 6 6 6 6 ...
$ Event_Time_Min : int 6 9 10 10 13 14 21 21 23 23 ...
$ Event_Time_HrGroup : Ord.factor w/ 8 levels "Group0_2"<"Group3_5"<..: 3 3 3 3 3 3 3 3 3 3 ...
$ BusDay_EventNum : int 1 2 3 4 5 6 7 8 9 10 ...
$ Route_Lag1 : chr NA "S80" "S80" "S80" ...
$ RouteAlt_Lag1 : Factor w/ 14 levels "1","10","11",..: NA 1 1 1 1 1 1 6 6 6 ...
$ Odometer_Distance_Lag1: int NA 43543 45139 46418 50115 51074 51303 55633 56163 56285 ...
$ Latitude_L1 : num NA 38.8 38.8 38.8 38.8 ...
$ Longitude_L1 : num NA -77.2 -77.2 -77.2 -77.2 ...
$ TravelDistance_Ft : int NA 1596 1279 3697 959 229 4330 530 122 977 ...
$ TravelDistance_Mi : num NA 0.302 0.242 0.7 0.182 ...
$ TravelDistance_Mi_Hvrs: num NA 0.15 0.105 0.165 0.832 ...
$ TravelTime_Sec : num NA 180 37 25 190 29 288 52 76 8 ...
$ TravelTime_Hr : num NA 0.05 0.01028 0.00694 0.05278 ...
$ SpeedAvg_Mph : num NA 6.05 23.57 100.83 3.44 ...
$ Start_ID : chr NA "5004572" "5004573" "5002210" ...
$ Start_Desc : chr NA "BEULAH ST + CHARLES ARRINGTON DR" "WALKER LN + #6363" "WALKER LN + BEULAH ST" ...
$ StartStop_ID : chr "NULL--5004572" "5004572--5004573" "5004573--5002210" "5002210--5002209" ...
$ TD_Mi_q2 : num 0.0521 0.0521 0.0521 0.0521 0.0521 ...
$ TD_Mi_q98 : num 0.959 0.959 0.959 0.959 0.959 ...
$ TT_Sec_q2 : num 10 10 10 10 10 10 10 10 10 10 ...
$ TT_Sec_q98 : num 349 349 349 349 349 349 349 349 349 349 ...
$ TT_Hr_q2 : num 0.00278 0.00278 0.00278 0.00278 0.00278 ...
$ TT_Hr_q98 : num 0.0969 0.0969 0.0969 0.0969 0.0969 ...
$ TD_Mi_Mean : num 0.308 0.308 0.308 0.308 0.308 ...
$ TD_Mi_Mean_F : num 0.232 0.232 0.232 0.232 0.232 ...
$ TD_Mi_Med : num 0.198 0.198 0.198 0.198 0.198 ...
$ TD_Mi_Med_F : num 0.198 0.198 0.198 0.198 0.198 ...
$ TD_Mi_Cnt : int 2486795 2486795 2486795 2486795 2486795 2486795 2486795 2486795 2486795 2486795 ...
$ TD_Mi_Cnt_F : int 2387406 2387406 2387406 2387406 2387406 2387406 2387406 2387406 2387406 2387406 ...
$ TT_Sec_Mean : num 105 105 105 105 105 ...
$ TT_Sec_Mean_F : num 56.6 56.6 56.6 56.6 56.6 ...
$ TT_Sec_Med : num 39 39 39 39 39 39 39 39 39 39 ...
$ TT_Sec_Med_F : num 39 39 39 39 39 39 39 39 39 39 ...
$ TT_Sec_Cnt : int 2802888 2802888 2802888 2802888 2802888 2802888 2802888 2802888 2802888 2802888 ...
$ TT_Sec_Cnt_F : int 2705189 2705189 2705189 2705189 2705189 2705189 2705189 2705189 2705189 2705189 ...
$ TT_Hr_Mean : num 0.0291 0.0291 0.0291 0.0291 0.0291 ...
$ TT_Hr_Mean_F : num 0.0157 0.0157 0.0157 0.0157 0.0157 ...
$ TT_Hr_Med : num 0.0108 0.0108 0.0108 0.0108 0.0108 ...
$ TT_Hr_Med_F : num 0.0108 0.0108 0.0108 0.0108 0.0108 ...
$ TT_Hr_Cnt : int 2802888 2802888 2802888 2802888 2802888 2802888 2802888 2802888 2802888 2802888 ...
$ TT_Hr_Cnt_F : int 2705189 2705189 2705189 2705189 2705189 2705189 2705189 2705189 2705189 2705189 ...
$ TD_Mi_SS_q5 : num NA 0.0252 0.2422 0.7324 0.0794 ...
$ TD_Mi_SS_q95 : num NA 0.626 0.242 1.008 0.176 ...
$ TT_Sec_SS_q5 : num NA 11.9 37 30.5 172.9 ...
$ TT_Sec_SS_q95 : num NA 346.3 37 75.8 189.1 ...
$ TT_Hr_SS_q5 : num NA 0.00331 0.01028 0.00849 0.04803 ...
$ TT_Hr_SS_q95 : num NA 0.0962 0.0103 0.0211 0.0525 ...
$ TD_Mi_SS_Mean : num NaN 0.437 0.242 0.908 0.128 ...
$ TD_Mi_SS_Mean_F : num NaN 0.457 0.242 0.977 NaN ...
$ TD_Mi_SS_Med : num NA 0.512 0.242 0.962 0.128 ...
$ TD_Mi_SS_Med_F : num NA 0.512 0.242 1.008 NA ...
$ TD_Mi_SS_Cnt : int 0 14 1 4 2 87 22 118 91 11 ...
$ TD_Mi_SS_Cnt_F : int 0 12 1 3 0 77 18 106 81 9 ...
$ TT_Sec_SS_Mean : num NaN 215.8 37 58.2 181 ...
$ TT_Sec_SS_Mean_F : num NaN 218.9 37 65.5 NaN ...
$ TT_Sec_SS_Med : num NA 223.5 37 65.5 181 ...
$ TT_Sec_SS_Med_F : num NA 223.5 37 65.5 NA ...
$ TT_Sec_SS_Cnt : int 0 14 1 4 2 173 22 141 141 11 ...
$ TT_Sec_SS_Cnt_F : int 0 12 1 2 0 156 18 127 128 9 ...
$ TT_Hr_SS_Mean : num NaN 0.0599 0.0103 0.0162 0.0503 ...
$ TT_Hr_SS_Mean_F : num NaN 0.0608 0.0103 0.0182 NaN ...
$ TT_Hr_SS_Med : num NA 0.0621 0.0103 0.0182 0.0503 ...
$ TT_Hr_SS_Med_F : num NA 0.0621 0.0103 0.0182 NA ...
$ TT_Hr_SS_Cnt : int 0 14 1 4 2 173 22 141 141 11 ...
$ TT_Hr_SS_Cnt_F : int 0 12 1 2 0 156 18 127 128 9 ...
$ TD_Mi_SSHG_q5 : num NA 0.0996 0.2422 0.7002 0.1816 ...
$ TD_Mi_SSHG_q95 : num NA 0.627 0.242 0.7 0.182 ...
$ TT_Sec_SSHG_q5 : num NA 59.6 37 25 190 11.6 236 51.5 55 8.8 ...
$ TT_Sec_SSHG_q95 : num NA 276 37 25 190 ...
$ TT_Hr_SSHG_q5 : num NA 0.01656 0.01028 0.00694 0.05278 ...
[list output truncated]
# View(head(Stats_StSt_HrGrp, 50))
Feature engineering.
Creating a BusEventRoute row number, and a RouteAlt_Lag1 indicator for future identification purposes.
# rm(Quantiles_dt)
# rm(Quantiles_SS_dt)
# rm(AllDays_BusDay)
# rm(Quantiles_SSHG_dt)
# rm(Stats_StSt)
# AllDays_BusDayRoute <- group_by(Stats_StSt_HrGrp,
# Bus_ID,
# Event_Time_Date,
# Route
# ) %>%
# mutate(RouteAlt_Lag2 = lag(RouteAlt) # used in future analyses to identify RouteAlt (direction) changes
#
# # Odometer_Distance_Lag1 = lag(Odometer_Distance),
# #
# # # accounting for potential negative distances
# # TravelDistance_Ft = ifelse(Odometer_Distance >= Odometer_Distance_Lag1,
# # Odometer_Distance - Odometer_Distance_Lag1,
# # NA
# # ),
# # TravelDistance_Mi = TravelDistance_Ft / 5280, #5,280 feet in 1 mile
# #
# # # accounting for potential negative times
# # TravelTime_Sec = as.numeric(ifelse(Event_Time >= lag(Departure_Time),
# # Event_Time - lag(Departure_Time),
# # NA
# # )
# # ),
# # TravelTime_Hr = TravelTime_Sec / 3600, # 3,600 seconds in 1 hour
# #
# # # accounting for potential negative or zero travel times
# # SpeedAvg_Mph = ifelse(TravelTime_Hr > 0,
# # TravelDistance_Mi / TravelTime_Hr,
# # NA
# # )
# ) %>%
# data.frame()
#
# rm(Stats_StSt_HrGrp)
# str(AllDays_BusDayRoute)
Feature engineering.
Calculating a variable to know if the RouteAlt changed. Could be useful in helping identifying weirdness in calculated distances and speeds.
# rm(Stats_StSt_HrGrp)
AllDays_DirChange <- Stats_StSt_HrGrp %>% # AllDays_BusDayRoute %>%
mutate(RteChange = ifelse(Route == Route_Lag1,
"Same",
"Change"
),
RteChange2 = factor(ifelse(is.na(RteChange),
"Change",
RteChange
)
),
DirChange = ifelse(RouteAlt == RouteAlt_Lag1,
"Same",
"Change"
),
DirChange2 = factor(ifelse(is.na(DirChange),
"Change",
DirChange
)
)
)
# rm(AllDays_BusDayRoute)
rm(Stats_StSt_HrGrp)
str(AllDays_DirChange)
'data.frame': 2809529 obs. of 122 variables:
$ group : Factor w/ 5 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Bus_ID : int 11 11 11 11 11 11 11 11 11 11 ...
$ Route : chr "S80" "S80" "S80" "S80" ...
$ RouteAlt : Factor w/ 14 levels "1","10","11",..: 1 1 1 1 1 1 6 6 6 6 ...
$ Route_Direction : Factor w/ 12 levels "","ANTICLKW",..: 6 6 6 6 6 6 6 6 6 6 ...
$ Stop_Sequence : int 7 6 3 2 8 1 2 3 4 2 ...
$ Stop_ID : chr "5004572" "5004573" "5002210" "5002209" ...
$ Stop_Desc : chr "BEULAH ST + CHARLES ARRINGTON DR" "WALKER LN + #6363" "WALKER LN + BEULAH ST" "BEULAH ST + CHARLES ARRINGTON DR" ...
$ Event_Type : int 4 4 4 4 3 3 4 4 4 4 ...
$ Event_Description : Factor w/ 3 levels "Serviced Stop ",..: 3 3 3 3 1 1 3 3 3 3 ...
$ Event_Time : POSIXct, format: "2016-10-03 06:06:47" "2016-10-03 06:09:47" ...
$ Departure_Time : POSIXct, format: "2016-10-03 06:06:47" "2016-10-03 06:09:47" ...
$ Dwell_Time : int 0 0 0 0 0 104 0 0 0 0 ...
$ Delta_Time : int -177 24 165 25 73 719 74 76 63 69 ...
$ Odometer_Distance : int 43543 45139 46418 50115 51074 51303 55633 56163 56285 57262 ...
$ Latitude : num 38.8 38.8 38.8 38.8 38.8 ...
$ Longitude : num -77.2 -77.2 -77.2 -77.2 -77.2 ...
$ Heading : int 199 97 276 15 119 100 274 104 241 274 ...
$ RowNum_OG : int 1 3 4 5 6 7 9 10 11 12 ...
$ StopID_New : int NA NA NA NA NA NA NA NA NA NA ...
$ StopID_Clean : chr "5004572" "5004573" "5002210" "5002209" ...
$ StopID_Indicator : Factor w/ 2 levels "ID_Bad","ID_OK": 2 2 2 2 2 2 2 2 2 2 ...
$ Odometer_Distance_Mi : num 8.25 8.55 8.79 9.49 9.67 ...
$ Dwell_Time2 : num 0 0 0 0 0 104 0 0 0 0 ...
$ Event_Time_Yr : int 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ...
$ Event_Time_Mth : int 10 10 10 10 10 10 10 10 10 10 ...
$ Event_Time_Date : int 3 3 3 3 3 3 3 3 3 3 ...
$ Event_Time_Day : Ord.factor w/ 7 levels "Sun"<"Mon"<"Tues"<..: 2 2 2 2 2 2 2 2 2 2 ...
$ Event_Time_Hr : int 6 6 6 6 6 6 6 6 6 6 ...
$ Event_Time_Min : int 6 9 10 10 13 14 21 21 23 23 ...
$ Event_Time_HrGroup : Ord.factor w/ 8 levels "Group0_2"<"Group3_5"<..: 3 3 3 3 3 3 3 3 3 3 ...
$ BusDay_EventNum : int 1 2 3 4 5 6 7 8 9 10 ...
$ Route_Lag1 : chr NA "S80" "S80" "S80" ...
$ RouteAlt_Lag1 : Factor w/ 14 levels "1","10","11",..: NA 1 1 1 1 1 1 6 6 6 ...
$ Odometer_Distance_Lag1: int NA 43543 45139 46418 50115 51074 51303 55633 56163 56285 ...
$ Latitude_L1 : num NA 38.8 38.8 38.8 38.8 ...
$ Longitude_L1 : num NA -77.2 -77.2 -77.2 -77.2 ...
$ TravelDistance_Ft : int NA 1596 1279 3697 959 229 4330 530 122 977 ...
$ TravelDistance_Mi : num NA 0.302 0.242 0.7 0.182 ...
$ TravelDistance_Mi_Hvrs: num NA 0.15 0.105 0.165 0.832 ...
$ TravelTime_Sec : num NA 180 37 25 190 29 288 52 76 8 ...
$ TravelTime_Hr : num NA 0.05 0.01028 0.00694 0.05278 ...
$ SpeedAvg_Mph : num NA 6.05 23.57 100.83 3.44 ...
$ Start_ID : chr NA "5004572" "5004573" "5002210" ...
$ Start_Desc : chr NA "BEULAH ST + CHARLES ARRINGTON DR" "WALKER LN + #6363" "WALKER LN + BEULAH ST" ...
$ StartStop_ID : chr "NULL--5004572" "5004572--5004573" "5004573--5002210" "5002210--5002209" ...
$ TD_Mi_q2 : num 0.0521 0.0521 0.0521 0.0521 0.0521 ...
$ TD_Mi_q98 : num 0.959 0.959 0.959 0.959 0.959 ...
$ TT_Sec_q2 : num 10 10 10 10 10 10 10 10 10 10 ...
$ TT_Sec_q98 : num 349 349 349 349 349 349 349 349 349 349 ...
$ TT_Hr_q2 : num 0.00278 0.00278 0.00278 0.00278 0.00278 ...
$ TT_Hr_q98 : num 0.0969 0.0969 0.0969 0.0969 0.0969 ...
$ TD_Mi_Mean : num 0.308 0.308 0.308 0.308 0.308 ...
$ TD_Mi_Mean_F : num 0.232 0.232 0.232 0.232 0.232 ...
$ TD_Mi_Med : num 0.198 0.198 0.198 0.198 0.198 ...
$ TD_Mi_Med_F : num 0.198 0.198 0.198 0.198 0.198 ...
$ TD_Mi_Cnt : int 2486795 2486795 2486795 2486795 2486795 2486795 2486795 2486795 2486795 2486795 ...
$ TD_Mi_Cnt_F : int 2387406 2387406 2387406 2387406 2387406 2387406 2387406 2387406 2387406 2387406 ...
$ TT_Sec_Mean : num 105 105 105 105 105 ...
$ TT_Sec_Mean_F : num 56.6 56.6 56.6 56.6 56.6 ...
$ TT_Sec_Med : num 39 39 39 39 39 39 39 39 39 39 ...
$ TT_Sec_Med_F : num 39 39 39 39 39 39 39 39 39 39 ...
$ TT_Sec_Cnt : int 2802888 2802888 2802888 2802888 2802888 2802888 2802888 2802888 2802888 2802888 ...
$ TT_Sec_Cnt_F : int 2705189 2705189 2705189 2705189 2705189 2705189 2705189 2705189 2705189 2705189 ...
$ TT_Hr_Mean : num 0.0291 0.0291 0.0291 0.0291 0.0291 ...
$ TT_Hr_Mean_F : num 0.0157 0.0157 0.0157 0.0157 0.0157 ...
$ TT_Hr_Med : num 0.0108 0.0108 0.0108 0.0108 0.0108 ...
$ TT_Hr_Med_F : num 0.0108 0.0108 0.0108 0.0108 0.0108 ...
$ TT_Hr_Cnt : int 2802888 2802888 2802888 2802888 2802888 2802888 2802888 2802888 2802888 2802888 ...
$ TT_Hr_Cnt_F : int 2705189 2705189 2705189 2705189 2705189 2705189 2705189 2705189 2705189 2705189 ...
$ TD_Mi_SS_q5 : num NA 0.0252 0.2422 0.7324 0.0794 ...
$ TD_Mi_SS_q95 : num NA 0.626 0.242 1.008 0.176 ...
$ TT_Sec_SS_q5 : num NA 11.9 37 30.5 172.9 ...
$ TT_Sec_SS_q95 : num NA 346.3 37 75.8 189.1 ...
$ TT_Hr_SS_q5 : num NA 0.00331 0.01028 0.00849 0.04803 ...
$ TT_Hr_SS_q95 : num NA 0.0962 0.0103 0.0211 0.0525 ...
$ TD_Mi_SS_Mean : num NaN 0.437 0.242 0.908 0.128 ...
$ TD_Mi_SS_Mean_F : num NaN 0.457 0.242 0.977 NaN ...
$ TD_Mi_SS_Med : num NA 0.512 0.242 0.962 0.128 ...
$ TD_Mi_SS_Med_F : num NA 0.512 0.242 1.008 NA ...
$ TD_Mi_SS_Cnt : int 0 14 1 4 2 87 22 118 91 11 ...
$ TD_Mi_SS_Cnt_F : int 0 12 1 3 0 77 18 106 81 9 ...
$ TT_Sec_SS_Mean : num NaN 215.8 37 58.2 181 ...
$ TT_Sec_SS_Mean_F : num NaN 218.9 37 65.5 NaN ...
$ TT_Sec_SS_Med : num NA 223.5 37 65.5 181 ...
$ TT_Sec_SS_Med_F : num NA 223.5 37 65.5 NA ...
$ TT_Sec_SS_Cnt : int 0 14 1 4 2 173 22 141 141 11 ...
$ TT_Sec_SS_Cnt_F : int 0 12 1 2 0 156 18 127 128 9 ...
$ TT_Hr_SS_Mean : num NaN 0.0599 0.0103 0.0162 0.0503 ...
$ TT_Hr_SS_Mean_F : num NaN 0.0608 0.0103 0.0182 NaN ...
$ TT_Hr_SS_Med : num NA 0.0621 0.0103 0.0182 0.0503 ...
$ TT_Hr_SS_Med_F : num NA 0.0621 0.0103 0.0182 NA ...
$ TT_Hr_SS_Cnt : int 0 14 1 4 2 173 22 141 141 11 ...
$ TT_Hr_SS_Cnt_F : int 0 12 1 2 0 156 18 127 128 9 ...
$ TD_Mi_SSHG_q5 : num NA 0.0996 0.2422 0.7002 0.1816 ...
$ TD_Mi_SSHG_q95 : num NA 0.627 0.242 0.7 0.182 ...
$ TT_Sec_SSHG_q5 : num NA 59.6 37 25 190 11.6 236 51.5 55 8.8 ...
$ TT_Sec_SSHG_q95 : num NA 276 37 25 190 ...
$ TT_Hr_SSHG_q5 : num NA 0.01656 0.01028 0.00694 0.05278 ...
[list output truncated]
View(filter(AllDays_DirChange,
between(RowNum_OG, 2570060, 2570080)
) %>%
select(-matches("(q(2|5|(95)|(98)))|Mean|Med|Cnt")
)
)
Re-ordering the variables to ease with comprehension.
AllDays_NewOrder <- select(AllDays_DirChange,
RowNum_OG,
group,
StartStop_ID,
BusDay_EventNum,
Bus_ID,
Route,
RteChange2,
RouteAlt,
# RouteAlt_Lag1,
DirChange2,
Route_Direction,
Stop_Sequence,
Start_ID,
Start_Desc,
# Stop_ID,
StopID_Clean,
StopID_Indicator,
Stop_Desc,
Event_Type,
Event_Description,
Event_Time_Yr,
Event_Time_Mth,
Event_Time_Date,
Event_Time_Day,
Event_Time_Hr,
Event_Time_HrGroup,
Event_Time_Min,
Event_Time,
Departure_Time,
Dwell_Time,
Dwell_Time2,
Delta_Time,
Latitude,
Longitude,
Heading,
Odometer_Distance,
Odometer_Distance_Lag1,
Odometer_Distance_Mi,
TravelDistance_Ft,
TravelDistance_Mi,
TravelDistance_Mi_Hvrs,
TD_Mi_q2,
TD_Mi_q98,
TD_Mi_SS_q5,
TD_Mi_SS_q95,
TD_Mi_SSHG_q5,
TD_Mi_SSHG_q95,
TD_Mi_Mean,
TD_Mi_Mean_F,
TD_Mi_SS_Mean,
TD_Mi_SS_Mean_F,
TD_Mi_SSHG_Mean,
TD_Mi_SSHG_Mean_F,
TD_Mi_Med,
TD_Mi_Med_F,
TD_Mi_SS_Med,
TD_Mi_SS_Med_F,
TD_Mi_SSHG_Med,
TD_Mi_SSHG_Med_F,
TD_Mi_Cnt,
TD_Mi_Cnt_F,
TD_Mi_SS_Cnt,
TD_Mi_SS_Cnt_F,
TD_Mi_SSHG_Cnt,
TD_Mi_SSHG_Cnt_F,
TravelTime_Sec,
TT_Sec_q2,
TT_Sec_q98,
TT_Sec_SS_q5,
TT_Sec_SS_q95,
TT_Sec_SSHG_q5,
TT_Sec_SSHG_q95,
TT_Sec_Mean,
TT_Sec_Mean_F,
TT_Sec_SS_Mean,
TT_Sec_SS_Mean_F,
TT_Sec_SSHG_Mean,
TT_Sec_SSHG_Mean_F,
TT_Sec_Med,
TT_Sec_Med_F,
TT_Sec_SS_Med,
TT_Sec_SS_Med_F,
TT_Sec_SSHG_Med,
TT_Sec_SSHG_Med_F,
TT_Sec_Cnt,
TT_Sec_Cnt_F,
TT_Sec_SS_Cnt,
TT_Sec_SS_Cnt_F,
TT_Sec_SSHG_Cnt,
TT_Sec_SSHG_Cnt_F,
TravelTime_Hr,
TT_Hr_q2,
TT_Hr_q98,
TT_Hr_SS_q5,
TT_Hr_SS_q95,
TT_Hr_SSHG_q5,
TT_Hr_SSHG_q95,
TT_Hr_Mean,
TT_Hr_Mean_F,
TT_Hr_SS_Mean,
TT_Hr_SS_Mean_F,
TT_Hr_SSHG_Mean,
TT_Hr_SSHG_Mean_F,
TT_Hr_Med,
TT_Hr_Med_F,
TT_Hr_SS_Med,
TT_Hr_SS_Med_F,
TT_Hr_SSHG_Med,
TT_Hr_SSHG_Med_F,
TT_Hr_Cnt,
TT_Hr_Cnt_F,
TT_Hr_SS_Cnt,
TT_Hr_SS_Cnt_F,
TT_Hr_SSHG_Cnt,
TT_Hr_SSHG_Cnt_F,
SpeedAvg_Mph
)
rm(AllDays_DirChange)
str(AllDays_NewOrder)
'data.frame': 2809529 obs. of 114 variables:
$ RowNum_OG : int 1 3 4 5 6 7 9 10 11 12 ...
$ group : Factor w/ 5 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
$ StartStop_ID : chr "NULL--5004572" "5004572--5004573" "5004573--5002210" "5002210--5002209" ...
$ BusDay_EventNum : int 1 2 3 4 5 6 7 8 9 10 ...
$ Bus_ID : int 11 11 11 11 11 11 11 11 11 11 ...
$ Route : chr "S80" "S80" "S80" "S80" ...
$ RteChange2 : Factor w/ 2 levels "Change","Same": 1 2 2 2 2 2 2 2 2 2 ...
$ RouteAlt : Factor w/ 14 levels "1","10","11",..: 1 1 1 1 1 1 6 6 6 6 ...
$ DirChange2 : Factor w/ 2 levels "Change","Same": 1 2 2 2 2 2 1 2 2 2 ...
$ Route_Direction : Factor w/ 12 levels "","ANTICLKW",..: 6 6 6 6 6 6 6 6 6 6 ...
$ Stop_Sequence : int 7 6 3 2 8 1 2 3 4 2 ...
$ Start_ID : chr NA "5004572" "5004573" "5002210" ...
$ Start_Desc : chr NA "BEULAH ST + CHARLES ARRINGTON DR" "WALKER LN + #6363" "WALKER LN + BEULAH ST" ...
$ StopID_Clean : chr "5004572" "5004573" "5002210" "5002209" ...
$ StopID_Indicator : Factor w/ 2 levels "ID_Bad","ID_OK": 2 2 2 2 2 2 2 2 2 2 ...
$ Stop_Desc : chr "BEULAH ST + CHARLES ARRINGTON DR" "WALKER LN + #6363" "WALKER LN + BEULAH ST" "BEULAH ST + CHARLES ARRINGTON DR" ...
$ Event_Type : int 4 4 4 4 3 3 4 4 4 4 ...
$ Event_Description : Factor w/ 3 levels "Serviced Stop ",..: 3 3 3 3 1 1 3 3 3 3 ...
$ Event_Time_Yr : int 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ...
$ Event_Time_Mth : int 10 10 10 10 10 10 10 10 10 10 ...
$ Event_Time_Date : int 3 3 3 3 3 3 3 3 3 3 ...
$ Event_Time_Day : Ord.factor w/ 7 levels "Sun"<"Mon"<"Tues"<..: 2 2 2 2 2 2 2 2 2 2 ...
$ Event_Time_Hr : int 6 6 6 6 6 6 6 6 6 6 ...
$ Event_Time_HrGroup : Ord.factor w/ 8 levels "Group0_2"<"Group3_5"<..: 3 3 3 3 3 3 3 3 3 3 ...
$ Event_Time_Min : int 6 9 10 10 13 14 21 21 23 23 ...
$ Event_Time : POSIXct, format: "2016-10-03 06:06:47" "2016-10-03 06:09:47" ...
$ Departure_Time : POSIXct, format: "2016-10-03 06:06:47" "2016-10-03 06:09:47" ...
$ Dwell_Time : int 0 0 0 0 0 104 0 0 0 0 ...
$ Dwell_Time2 : num 0 0 0 0 0 104 0 0 0 0 ...
$ Delta_Time : int -177 24 165 25 73 719 74 76 63 69 ...
$ Latitude : num 38.8 38.8 38.8 38.8 38.8 ...
$ Longitude : num -77.2 -77.2 -77.2 -77.2 -77.2 ...
$ Heading : int 199 97 276 15 119 100 274 104 241 274 ...
$ Odometer_Distance : int 43543 45139 46418 50115 51074 51303 55633 56163 56285 57262 ...
$ Odometer_Distance_Lag1: int NA 43543 45139 46418 50115 51074 51303 55633 56163 56285 ...
$ Odometer_Distance_Mi : num 8.25 8.55 8.79 9.49 9.67 ...
$ TravelDistance_Ft : int NA 1596 1279 3697 959 229 4330 530 122 977 ...
$ TravelDistance_Mi : num NA 0.302 0.242 0.7 0.182 ...
$ TravelDistance_Mi_Hvrs: num NA 0.15 0.105 0.165 0.832 ...
$ TD_Mi_q2 : num 0.0521 0.0521 0.0521 0.0521 0.0521 ...
$ TD_Mi_q98 : num 0.959 0.959 0.959 0.959 0.959 ...
$ TD_Mi_SS_q5 : num NA 0.0252 0.2422 0.7324 0.0794 ...
$ TD_Mi_SS_q95 : num NA 0.626 0.242 1.008 0.176 ...
$ TD_Mi_SSHG_q5 : num NA 0.0996 0.2422 0.7002 0.1816 ...
$ TD_Mi_SSHG_q95 : num NA 0.627 0.242 0.7 0.182 ...
$ TD_Mi_Mean : num 0.308 0.308 0.308 0.308 0.308 ...
$ TD_Mi_Mean_F : num 0.232 0.232 0.232 0.232 0.232 ...
$ TD_Mi_SS_Mean : num NaN 0.437 0.242 0.908 0.128 ...
$ TD_Mi_SS_Mean_F : num NaN 0.457 0.242 0.977 NaN ...
$ TD_Mi_SSHG_Mean : num NaN 0.442 0.242 0.7 0.182 ...
$ TD_Mi_SSHG_Mean_F : num NaN 0.491 0.242 0.7 0.182 ...
$ TD_Mi_Med : num 0.198 0.198 0.198 0.198 0.198 ...
$ TD_Mi_Med_F : num 0.198 0.198 0.198 0.198 0.198 ...
$ TD_Mi_SS_Med : num NA 0.512 0.242 0.962 0.128 ...
$ TD_Mi_SS_Med_F : num NA 0.512 0.242 1.008 NA ...
$ TD_Mi_SSHG_Med : num NA 0.512 0.242 0.7 0.182 ...
$ TD_Mi_SSHG_Med_F : num NA 0.512 0.242 0.7 0.182 ...
$ TD_Mi_Cnt : int 2486795 2486795 2486795 2486795 2486795 2486795 2486795 2486795 2486795 2486795 ...
$ TD_Mi_Cnt_F : int 2387406 2387406 2387406 2387406 2387406 2387406 2387406 2387406 2387406 2387406 ...
$ TD_Mi_SS_Cnt : int 0 14 1 4 2 87 22 118 91 11 ...
$ TD_Mi_SS_Cnt_F : int 0 12 1 3 0 77 18 106 81 9 ...
$ TD_Mi_SSHG_Cnt : int 0 7 1 1 1 23 6 29 28 3 ...
$ TD_Mi_SSHG_Cnt_F : int 0 5 1 1 1 19 4 25 24 1 ...
$ TravelTime_Sec : num NA 180 37 25 190 29 288 52 76 8 ...
$ TT_Sec_q2 : num 10 10 10 10 10 10 10 10 10 10 ...
$ TT_Sec_q98 : num 349 349 349 349 349 349 349 349 349 349 ...
$ TT_Sec_SS_q5 : num NA 11.9 37 30.5 172.9 ...
$ TT_Sec_SS_q95 : num NA 346.3 37 75.8 189.1 ...
$ TT_Sec_SSHG_q5 : num NA 59.6 37 25 190 11.6 236 51.5 55 8.8 ...
$ TT_Sec_SSHG_q95 : num NA 276 37 25 190 ...
$ TT_Sec_Mean : num 105 105 105 105 105 ...
$ TT_Sec_Mean_F : num 56.6 56.6 56.6 56.6 56.6 ...
$ TT_Sec_SS_Mean : num NaN 215.8 37 58.2 181 ...
$ TT_Sec_SS_Mean_F : num NaN 218.9 37 65.5 NaN ...
$ TT_Sec_SSHG_Mean : num NaN 202 37 25 190 ...
$ TT_Sec_SSHG_Mean_F : num NaN 226 37 25 190 ...
$ TT_Sec_Med : num 39 39 39 39 39 39 39 39 39 39 ...
$ TT_Sec_Med_F : num 39 39 39 39 39 39 39 39 39 39 ...
$ TT_Sec_SS_Med : num NA 223.5 37 65.5 181 ...
$ TT_Sec_SS_Med_F : num NA 223.5 37 65.5 NA ...
$ TT_Sec_SSHG_Med : num NA 219 37 25 190 134 286 60 65 16 ...
$ TT_Sec_SSHG_Med_F : num NA 219 37 25 190 134 286 60 65 16 ...
$ TT_Sec_Cnt : int 2802888 2802888 2802888 2802888 2802888 2802888 2802888 2802888 2802888 2802888 ...
$ TT_Sec_Cnt_F : int 2705189 2705189 2705189 2705189 2705189 2705189 2705189 2705189 2705189 2705189 ...
$ TT_Sec_SS_Cnt : int 0 14 1 4 2 173 22 141 141 11 ...
$ TT_Sec_SS_Cnt_F : int 0 12 1 2 0 156 18 127 128 9 ...
$ TT_Sec_SSHG_Cnt : int 0 7 1 1 1 35 6 36 35 3 ...
$ TT_Sec_SSHG_Cnt_F : int 0 5 1 1 1 31 4 32 32 1 ...
$ TravelTime_Hr : num NA 0.05 0.01028 0.00694 0.05278 ...
$ TT_Hr_q2 : num 0.00278 0.00278 0.00278 0.00278 0.00278 ...
$ TT_Hr_q98 : num 0.0969 0.0969 0.0969 0.0969 0.0969 ...
$ TT_Hr_SS_q5 : num NA 0.00331 0.01028 0.00849 0.04803 ...
$ TT_Hr_SS_q95 : num NA 0.0962 0.0103 0.0211 0.0525 ...
$ TT_Hr_SSHG_q5 : num NA 0.01656 0.01028 0.00694 0.05278 ...
$ TT_Hr_SSHG_q95 : num NA 0.07653 0.01028 0.00694 0.05278 ...
$ TT_Hr_Mean : num 0.0291 0.0291 0.0291 0.0291 0.0291 ...
$ TT_Hr_Mean_F : num 0.0157 0.0157 0.0157 0.0157 0.0157 ...
$ TT_Hr_SS_Mean : num NaN 0.0599 0.0103 0.0162 0.0503 ...
$ TT_Hr_SS_Mean_F : num NaN 0.0608 0.0103 0.0182 NaN ...
[list output truncated]
# View(head(AllDays_NewOrder, 500))
# View(tail(AllDays_NewOrder, 500))
Summarizing the data to help spot anomolies.
summary(AllDays_NewOrder)
RowNum_OG group StartStop_ID BusDay_EventNum Bus_ID
Min. : 1 1:559521 Length:2809529 Min. : 1.0 Min. : 11
1st Qu.: 784722 2:561389 Class :character 1st Qu.: 113.0 1st Qu.:2922
Median :1563300 3:567794 Mode :character Median : 248.0 Median :6195
Mean :1562504 4:559180 Mean : 290.5 Mean :5382
3rd Qu.:2337981 5:561645 3rd Qu.: 428.0 3rd Qu.:7104
Max. :3119443 Max. :1344.0 Max. :8105
Route RteChange2 RouteAlt DirChange2
Length:2809529 Change: 23772 2 :1128810 Change: 65126
Class :character Same :2785757 1 :1065425 Same :2744403
Mode :character 3 : 260372
4 : 130801
5 : 75039
6 : 56408
(Other): 92674
Route_Direction Stop_Sequence Start_ID Start_Desc
SOUTH :739235 Min. : 1.00 Length:2809529 Length:2809529
NORTH :735203 1st Qu.: 12.00 Class :character Class :character
WEST :649706 Median : 24.00 Mode :character Mode :character
EAST :628074 Mean : 26.83
LOOP : 35611 3rd Qu.: 39.00
CLOCKWIS: 10671 Max. :104.00
(Other) : 11029
StopID_Clean StopID_Indicator Stop_Desc Event_Type
Length:2809529 ID_Bad: 18948 Length:2809529 Min. :3.0
Class :character ID_OK :2790581 Class :character 1st Qu.:3.0
Mode :character Mode :character Median :4.0
Mean :3.6
3rd Qu.:4.0
Max. :5.0
Event_Description Event_Time_Yr
Serviced Stop :1127366 Min. :2016
Unknown Stop : 2579 1st Qu.:2016
UnServiced Stop :1679584 Median :2016
Mean :2016
3rd Qu.:2016
Max. :2016
Event_Time_Mth Event_Time_Date Event_Time_Day Event_Time_Hr Event_Time_HrGroup
Min. :10 Min. :3.000 Sun : 0 Min. : 0.00 Group6_8 :611612
1st Qu.:10 1st Qu.:4.000 Mon :559521 1st Qu.: 8.00 Group15_17:560103
Median :10 Median :5.000 Tues :561389 Median :13.00 Group18_20:461056
Mean :10 Mean :5.001 Wed :567794 Mean :12.97 Group9_11 :396514
3rd Qu.:10 3rd Qu.:6.000 Thurs:559180 3rd Qu.:18.00 Group12_14:353603
Max. :10 Max. :7.000 Fri :561645 Max. :23.00 Group21_23:244522
Sat : 0 (Other) :182119
Event_Time_Min Event_Time Departure_Time
Min. : 0.00 Min. :2016-10-03 00:00:00 Min. :2016-10-03 00:00:00
1st Qu.:14.00 1st Qu.:2016-10-04 08:36:14 1st Qu.:2016-10-04 08:36:20
Median :29.00 Median :2016-10-05 13:49:29 Median :2016-10-05 13:49:38
Mean :29.43 Mean :2016-10-05 13:29:21 Mean :2016-10-05 13:29:28
3rd Qu.:44.00 3rd Qu.:2016-10-06 17:58:06 3rd Qu.:2016-10-06 17:58:13
Max. :59.00 Max. :2016-10-07 23:59:59 Max. :2016-10-08 00:12:31
Dwell_Time Dwell_Time2 Delta_Time Latitude
Min. : 0.00 Min. : 0.000 Min. :-5606.0 Min. : 0.00
1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 14.0 1st Qu.:38.86
Median : 0.00 Median : 0.000 Median : 157.0 Median :38.90
Mean : 12.56 Mean : 6.359 Mean : 268.8 Mean :38.91
3rd Qu.: 5.00 3rd Qu.: 4.000 3rd Qu.: 396.0 3rd Qu.:38.96
Max. :6205.00 Max. :6205.000 Max. : 9426.0 Max. :39.19
Longitude Heading Odometer_Distance Odometer_Distance_Lag1
Min. :-77.45 Min. : 0.0 Min. : 0 Min. : 0
1st Qu.:-77.07 1st Qu.: 89.0 1st Qu.: 177595 1st Qu.: 177326
Median :-77.02 Median :180.0 Median : 377510 Median : 376934
Mean :-77.02 Mean :176.9 Mean : 426254 Mean : 425713
3rd Qu.:-76.97 3rd Qu.:269.0 3rd Qu.: 623667 3rd Qu.: 622879
Max. : 0.00 Max. :360.0 Max. :11108034 Max. :10853226
NA's :6528
Odometer_Distance_Mi TravelDistance_Ft TravelDistance_Mi TravelDistance_Mi_Hvrs
Min. : 0.00 Min. : 1 Min. : 0.0 Min. : 0.000
1st Qu.: 33.64 1st Qu.: 699 1st Qu.: 0.1 1st Qu.: 0.106
Median : 71.50 Median : 1044 Median : 0.2 Median : 0.142
Mean : 80.73 Mean : 1624 Mean : 0.3 Mean : 0.201
3rd Qu.: 118.12 3rd Qu.: 1518 3rd Qu.: 0.3 3rd Qu.: 0.193
Max. :2103.79 Max. :1323464 Max. :250.7 Max. :24.407
NA's :322734 NA's :322734 NA's :6528
TD_Mi_q2 TD_Mi_q98 TD_Mi_SS_q5 TD_Mi_SS_q95
Min. :0.05208 Min. :0.9585 Min. : 0.000 Min. : 0.000
1st Qu.:0.05208 1st Qu.:0.9585 1st Qu.: 0.086 1st Qu.: 0.262
Median :0.05208 Median :0.9585 Median : 0.104 Median : 0.326
Mean :0.05208 Mean :0.9585 Mean : 0.164 Mean : 0.488
3rd Qu.:0.05208 3rd Qu.:0.9585 3rd Qu.: 0.139 3rd Qu.: 0.436
Max. :0.05208 Max. :0.9585 Max. :219.163 Max. :246.949
NA's :24757 NA's :24757
TD_Mi_SSHG_q5 TD_Mi_SSHG_q95 TD_Mi_Mean TD_Mi_Mean_F
Min. : 0.00 Min. : 0.00 Min. :0.3076 Min. :0.2318
1st Qu.: 0.09 1st Qu.: 0.25 1st Qu.:0.3076 1st Qu.:0.2318
Median : 0.11 Median : 0.31 Median :0.3076 Median :0.2318
Mean : 0.18 Mean : 0.47 Mean :0.3076 Mean :0.2318
3rd Qu.: 0.15 3rd Qu.: 0.42 3rd Qu.:0.3076 3rd Qu.:0.2318
Max. :250.66 Max. :250.66 Max. :0.3076 Max. :0.2318
NA's :35629 NA's :35629
TD_Mi_SS_Mean TD_Mi_SS_Mean_F TD_Mi_SSHG_Mean TD_Mi_SSHG_Mean_F
Min. : 0.000 Min. : 0.000 Min. : 0.00 Min. : 0.00
1st Qu.: 0.172 1st Qu.: 0.166 1st Qu.: 0.17 1st Qu.: 0.16
Median : 0.212 Median : 0.207 Median : 0.21 Median : 0.21
Mean : 0.307 Mean : 0.291 Mean : 0.31 Mean : 0.29
3rd Qu.: 0.267 3rd Qu.: 0.260 3rd Qu.: 0.27 3rd Qu.: 0.26
Max. :219.163 Max. :219.163 Max. :250.66 Max. :250.66
NA's :24757 NA's :27919 NA's :35629 NA's :44458
TD_Mi_Med TD_Mi_Med_F TD_Mi_SS_Med TD_Mi_SS_Med_F
Min. :0.1977 Min. :0.1977 Min. : 0.000 Min. : 0.000
1st Qu.:0.1977 1st Qu.:0.1977 1st Qu.: 0.146 1st Qu.: 0.146
Median :0.1977 Median :0.1977 Median : 0.196 Median : 0.196
Mean :0.1977 Mean :0.1977 Mean : 0.288 Mean : 0.282
3rd Qu.:0.1977 3rd Qu.:0.1977 3rd Qu.: 0.265 3rd Qu.: 0.265
Max. :0.1977 Max. :0.1977 Max. :219.163 Max. :219.163
NA's :24757 NA's :27919
TD_Mi_SSHG_Med TD_Mi_SSHG_Med_F TD_Mi_Cnt TD_Mi_Cnt_F
Min. : 0.00 Min. : 0.00 Min. :2486795 Min. :2387406
1st Qu.: 0.14 1st Qu.: 0.14 1st Qu.:2486795 1st Qu.:2387406
Median : 0.20 Median : 0.20 Median :2486795 Median :2387406
Mean : 0.29 Mean : 0.28 Mean :2486795 Mean :2387406
3rd Qu.: 0.27 3rd Qu.: 0.27 3rd Qu.:2486795 3rd Qu.:2387406
Max. :250.66 Max. :250.66 Max. :2486795 Max. :2387406
NA's :35629 NA's :44458
TD_Mi_SS_Cnt TD_Mi_SS_Cnt_F TD_Mi_SSHG_Cnt TD_Mi_SSHG_Cnt_F
Min. : 0.0 Min. : 0.0 Min. : 0.00 Min. : 0.00
1st Qu.: 163.0 1st Qu.: 146.0 1st Qu.: 26.00 1st Qu.: 22.00
Median : 280.0 Median : 252.0 Median : 45.00 Median : 39.00
Mean : 347.4 Mean : 312.7 Mean : 57.27 Mean : 50.85
3rd Qu.: 456.0 3rd Qu.: 411.0 3rd Qu.: 75.00 3rd Qu.: 67.00
Max. :1543.0 Max. :1388.0 Max. :663.00 Max. :595.00
TravelTime_Sec TT_Sec_q2 TT_Sec_q98 TT_Sec_SS_q5 TT_Sec_SS_q95
Min. : 1.0 Min. :10 Min. :349 Min. : 1.00 Min. : 1.00
1st Qu.: 25.0 1st Qu.:10 1st Qu.:349 1st Qu.: 15.00 1st Qu.: 48.00
Median : 39.0 Median :10 Median :349 Median : 22.00 Median : 80.05
Mean : 104.9 Mean :10 Mean :349 Mean : 61.26 Mean : 183.28
3rd Qu.: 72.0 3rd Qu.:10 3rd Qu.:349 3rd Qu.: 34.00 3rd Qu.: 134.60
Max. :60750.0 Max. :10 Max. :349 Max. :60750.00 Max. :60750.00
NA's :6641 NA's :6531 NA's :6531
TT_Sec_SSHG_q5 TT_Sec_SSHG_q95 TT_Sec_Mean TT_Sec_Mean_F
Min. : 1.00 Min. : 1.00 Min. :104.9 Min. :56.61
1st Qu.: 16.00 1st Qu.: 43.80 1st Qu.:104.9 1st Qu.:56.61
Median : 23.40 Median : 72.95 Median :104.9 Median :56.61
Mean : 67.33 Mean : 169.21 Mean :104.9 Mean :56.61
3rd Qu.: 36.70 3rd Qu.: 123.65 3rd Qu.:104.9 3rd Qu.:56.61
Max. :60750.00 Max. :60750.00 Max. :104.9 Max. :56.61
NA's :6535 NA's :6535
TT_Sec_SS_Mean TT_Sec_SS_Mean_F TT_Sec_SSHG_Mean TT_Sec_SSHG_Mean_F
Min. : 1.00 Min. : 1.00 Min. : 1.00 Min. : 1.00
1st Qu.: 29.06 1st Qu.: 27.54 1st Qu.: 28.38 1st Qu.: 27.21
Median : 44.16 Median : 41.91 Median : 43.38 Median : 41.48
Mean : 104.88 Mean : 91.34 Mean : 104.88 Mean : 93.53
3rd Qu.: 73.30 3rd Qu.: 69.25 3rd Qu.: 72.93 3rd Qu.: 70.12
Max. :60750.00 Max. :60750.00 Max. :60750.00 Max. :60750.00
NA's :6531 NA's :10519 NA's :6535 NA's :12811
TT_Sec_Med TT_Sec_Med_F TT_Sec_SS_Med TT_Sec_SS_Med_F
Min. :39 Min. :39 Min. : 1.00 Min. : 1.00
1st Qu.:39 1st Qu.:39 1st Qu.: 26.00 1st Qu.: 26.00
Median :39 Median :39 Median : 39.00 Median : 39.00
Mean :39 Mean :39 Mean : 91.55 Mean : 84.82
3rd Qu.:39 3rd Qu.:39 3rd Qu.: 65.00 3rd Qu.: 65.00
Max. :39 Max. :39 Max. :60750.00 Max. :60750.00
NA's :6531 NA's :10519
TT_Sec_SSHG_Med TT_Sec_SSHG_Med_F TT_Sec_Cnt TT_Sec_Cnt_F
Min. : 1.00 Min. : 1.00 Min. :2802888 Min. :2705189
1st Qu.: 26.00 1st Qu.: 26.00 1st Qu.:2802888 1st Qu.:2705189
Median : 39.00 Median : 38.50 Median :2802888 Median :2705189
Mean : 94.94 Mean : 88.44 Mean :2802888 Mean :2705189
3rd Qu.: 67.00 3rd Qu.: 66.50 3rd Qu.:2802888 3rd Qu.:2705189
Max. :60750.00 Max. :60750.00 Max. :2802888 Max. :2705189
NA's :6535 NA's :12811
TT_Sec_SS_Cnt TT_Sec_SS_Cnt_F TT_Sec_SSHG_Cnt TT_Sec_SSHG_Cnt_F
Min. : 0.0 Min. : 0.0 Min. : 0.00 Min. : 0.00
1st Qu.: 194.0 1st Qu.: 177.0 1st Qu.: 29.00 1st Qu.: 26.00
Median : 310.0 Median : 282.0 Median : 51.00 Median : 46.00
Mean : 384.4 Mean : 349.8 Mean : 63.46 Mean : 57.09
3rd Qu.: 497.0 3rd Qu.: 452.0 3rd Qu.: 83.00 3rd Qu.: 74.00
Max. :1664.0 Max. :1523.0 Max. :691.00 Max. :634.00
TravelTime_Hr TT_Hr_q2 TT_Hr_q98 TT_Hr_SS_q5
Min. : 0.000 Min. :0.002778 Min. :0.09694 Min. : 0.000
1st Qu.: 0.007 1st Qu.:0.002778 1st Qu.:0.09694 1st Qu.: 0.004
Median : 0.011 Median :0.002778 Median :0.09694 Median : 0.006
Mean : 0.029 Mean :0.002778 Mean :0.09694 Mean : 0.017
3rd Qu.: 0.020 3rd Qu.:0.002778 3rd Qu.:0.09694 3rd Qu.: 0.009
Max. :16.875 Max. :0.002778 Max. :0.09694 Max. :16.875
NA's :6641 NA's :6531
TT_Hr_SS_q95 TT_Hr_SSHG_q5 TT_Hr_SSHG_q95 TT_Hr_Mean
Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. :0.02913
1st Qu.: 0.013 1st Qu.: 0.004 1st Qu.: 0.012 1st Qu.:0.02913
Median : 0.022 Median : 0.006 Median : 0.020 Median :0.02913
Mean : 0.051 Mean : 0.019 Mean : 0.047 Mean :0.02913
3rd Qu.: 0.037 3rd Qu.: 0.010 3rd Qu.: 0.034 3rd Qu.:0.02913
Max. :16.875 Max. :16.875 Max. :16.875 Max. :0.02913
NA's :6531 NA's :6535 NA's :6535
TT_Hr_Mean_F TT_Hr_SS_Mean TT_Hr_SS_Mean_F TT_Hr_SSHG_Mean
Min. :0.01573 Min. : 0.000 Min. : 0.000 Min. : 0.000
1st Qu.:0.01573 1st Qu.: 0.008 1st Qu.: 0.008 1st Qu.: 0.008
Median :0.01573 Median : 0.012 Median : 0.012 Median : 0.012
Mean :0.01573 Mean : 0.029 Mean : 0.025 Mean : 0.029
3rd Qu.:0.01573 3rd Qu.: 0.020 3rd Qu.: 0.019 3rd Qu.: 0.020
Max. :0.01573 Max. :16.875 Max. :16.875 Max. :16.875
NA's :6531 NA's :10532 NA's :6535
TT_Hr_SSHG_Mean_F TT_Hr_Med TT_Hr_Med_F TT_Hr_SS_Med
Min. : 0.000 Min. :0.01083 Min. :0.01083 Min. : 0.000
1st Qu.: 0.008 1st Qu.:0.01083 1st Qu.:0.01083 1st Qu.: 0.007
Median : 0.012 Median :0.01083 Median :0.01083 Median : 0.011
Mean : 0.026 Mean :0.01083 Mean :0.01083 Mean : 0.025
3rd Qu.: 0.019 3rd Qu.:0.01083 3rd Qu.:0.01083 3rd Qu.: 0.018
Max. :16.875 Max. :0.01083 Max. :0.01083 Max. :16.875
NA's :12895 NA's :6531
TT_Hr_SS_Med_F TT_Hr_SSHG_Med TT_Hr_SSHG_Med_F TT_Hr_Cnt
Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. :2802888
1st Qu.: 0.007 1st Qu.: 0.007 1st Qu.: 0.007 1st Qu.:2802888
Median : 0.011 Median : 0.011 Median : 0.011 Median :2802888
Mean : 0.024 Mean : 0.026 Mean : 0.025 Mean :2802888
3rd Qu.: 0.018 3rd Qu.: 0.019 3rd Qu.: 0.018 3rd Qu.:2802888
Max. :16.875 Max. :16.875 Max. :16.875 Max. :2802888
NA's :10532 NA's :6535 NA's :12895
TT_Hr_Cnt_F TT_Hr_SS_Cnt TT_Hr_SS_Cnt_F TT_Hr_SSHG_Cnt
Min. :2705189 Min. : 0.0 Min. : 0.0 Min. : 0.00
1st Qu.:2705189 1st Qu.: 194.0 1st Qu.: 176.0 1st Qu.: 29.00
Median :2705189 Median : 310.0 Median : 282.0 Median : 51.00
Mean :2705189 Mean : 384.4 Mean : 349.6 Mean : 63.46
3rd Qu.:2705189 3rd Qu.: 497.0 3rd Qu.: 452.0 3rd Qu.: 83.00
Max. :2705189 Max. :1664.0 Max. :1523.0 Max. :691.00
TT_Hr_SSHG_Cnt_F SpeedAvg_Mph
Min. : 0.00 Min. : 0.0
1st Qu.: 26.00 1st Qu.: 10.1
Median : 46.00 Median : 16.7
Mean : 57.05 Mean : 26.5
3rd Qu.: 74.00 3rd Qu.: 31.2
Max. :634.00 Max. :22924.1
NA's :322762
Investigation of TravelDistance_Mi.
View(TravDistMi_Pctiles): 99% of TravelDistance_Mi are about 1 mile or less…but some weird TravelDistance_Mi values (e.g., 584 miles traveled) exist.
TravDistMi_Ntile <- as.data.frame(AllDays_NewOrder$TravelDistance_Mi) %>%
mutate(#Pctile = ntile(AllDays_NewOrder$TravelDistance_Mi, 100),
#MinR = min_rank(AllDays_NewOrder$TravelDistance_Mi),
PctR = percent_rank(AllDays_NewOrder$TravelDistance_Mi),
PctR_Round = round(PctR, 2)
)
colnames(TravDistMi_Ntile)[1] <- "TravelDistance_Mi"
# str(TravDistMi_Ntile)
TravDistMi_Ntile_Rows <- nrow(TravDistMi_Ntile)
# View(tail(TravDistMi_Ntile, 500))
TravDistMi_Pctiles <- group_by(TravDistMi_Ntile,
PctR_Round
) %>%
summarise(
MinTravDistMiAtPctile = min(TravelDistance_Mi),
CntsAtPctile = n(),
PctsAtPctile = CntsAtPctile / TravDistMi_Ntile_Rows
) %>%
mutate(CumSumPAtP = cumsum(PctsAtPctile)
)
rm(TravDistMi_Ntile)
rm(TravDistMi_Ntile_Rows)
View(TravDistMi_Pctiles)
TravDistMi_Pctiles
Investigation of TravelDistance_Mi.
Why are some TravelDistance_Mi “NA”? It looks like partially because the records are the first trip of the day (for that bus), so I purposefully set the distance to “NA”. Another reason is due to the odometer recording a value less than the previous odometer recording. In most cases, I have no explanation for this - though I have observed about 67% of all instances where TravelDistance_Mi is NA (other than because it’s the first record of the day) are instances where DirChange2 is “Change”. This is weird and should be asked to WMATA.
# View(head(AllDays_NewOrder, 500))
View(filter(AllDays_NewOrder,
BusDay_EventNum != 1 # When BusDay_EventNum == 1, TravelDistance_Mi is NA by design (don't want to calculate distance based on yesterday's position)
) %>%
group_by(StartStop_ID) %>%
summarise(Cnts = sum(is.na(TravelDistance_Mi)
)
) %>%
arrange(desc(Cnts)
)
)
View(filter(AllDays_NewOrder,
StartStop_ID == "1000245--1000211"
) %>%
select(RowNum_OG,
StartStop_ID,
Event_Time,
Event_Time_HrGroup,
Bus_ID,
TravelDistance_Mi,
TravelDistance_Mi_Hvrs,
TD_Mi_SS_Mean,
TD_Mi_SS_Mean_F,
TD_Mi_SSHG_Mean,
TD_Mi_SSHG_Mean_F,
TD_Mi_SS_Med,
TD_Mi_SS_Med_F,
TD_Mi_SSHG_Med,
TD_Mi_SSHG_Med_F,
TD_Mi_SS_Cnt,
TD_Mi_SS_Cnt_F,
TD_Mi_SSHG_Cnt,
TD_Mi_SSHG_Cnt_F
) %>%
mutate(Ratio_MeanToHvrs = TD_Mi_SS_Mean / TravelDistance_Mi_Hvrs) %>%
arrange(Event_Time)
)
View(filter(AllDays_NewOrder,
is.na(TravelDistance_Mi)
)
)
# These records are NA becuase the record is the first record of the day (the Event_Time_Date)
View(filter(AllDays_NewOrder,
between(RowNum_OG, 326, 346) | # 336
between(RowNum_OG, 591, 611) | # 601
between(RowNum_OG, 845, 865) # 855
)
)
Investigation of TravelDistance_Mi.
These records are NA becuase the current record odometer is less than the previous record odometer. Theoretically, this should NOT happen. Me: it appears that about 67% of all instances where TravelDistance_Mi is NA (other than because it’s th first record of the day) are instances where DirChange2 is “Change”. This is weird and should be asked to WMATA.
View(filter(AllDays_NewOrder,
between(RowNum_OG, 194, 214) | # 204
between(RowNum_OG, 440, 460) | # 450
between(RowNum_OG, 478, 498) | # 488
between(RowNum_OG, 510, 530) # 520
)
)
TestTable <- filter(AllDays_NewOrder,
BusDay_EventNum != 1
) %>%
mutate(TravelDistance_NA = as.factor(ifelse(is.na(TravelDistance_Mi),
"True",
"False"
)
)
) %>%
group_by(DirChange2, TravelDistance_NA) %>%
summarise(TravDistMi_NACnts = n()
)
# TestTable
TestTable_Spread <- as.data.frame(spread(TestTable,
TravelDistance_NA,
TravDistMi_NACnts
)
) %>%
select(False,
True
)
row.names(TestTable_Spread) <- c("Change", "Same")
# str(TestTable_Spread)
# TestTable_Spread
prop.table(as.table(as.matrix(TestTable_Spread)
),
1
)
False True
Change 0.8267006 0.1732994
Same 0.8884818 0.1115182
prop.table(as.table(as.matrix(TestTable_Spread)
),
2
)
False True
Change 0.01948009 0.03211514
Same 0.98051991 0.96788486
Investigation of TravelDistance_Mi.
Let’s look at just the TravelDistance_Mi values that are NOT “NA”.
rm(TestTable)
rm(TestTable_Spread)
TravelDistance_Mi_NoNA <- filter(AllDays_NewOrder,
# TravelDistance_Mi != 0 &
!is.na(TravelDistance_Mi)
)
dim(AllDays_NewOrder)
[1] 2809529 114
dim(TravelDistance_Mi_NoNA)
[1] 2486795 114
nrow(AllDays_NewOrder) - nrow(TravelDistance_Mi_NoNA)
[1] 322734
str(TravelDistance_Mi_NoNA)
'data.frame': 2486795 obs. of 114 variables:
$ RowNum_OG : int 3 4 5 6 7 9 10 11 12 13 ...
$ group : Factor w/ 5 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
$ StartStop_ID : chr "5004572--5004573" "5004573--5002210" "5002210--5002209" "5002209--5000070" ...
$ BusDay_EventNum : int 2 3 4 5 6 7 8 9 10 11 ...
$ Bus_ID : int 11 11 11 11 11 11 11 11 11 11 ...
$ Route : chr "S80" "S80" "S80" "S80" ...
$ RteChange2 : Factor w/ 2 levels "Change","Same": 2 2 2 2 2 2 2 2 2 2 ...
$ RouteAlt : Factor w/ 14 levels "1","10","11",..: 1 1 1 1 1 6 6 6 6 6 ...
$ DirChange2 : Factor w/ 2 levels "Change","Same": 2 2 2 2 2 1 2 2 2 2 ...
$ Route_Direction : Factor w/ 12 levels "","ANTICLKW",..: 6 6 6 6 6 6 6 6 6 6 ...
$ Stop_Sequence : int 6 3 2 8 1 2 3 4 2 6 ...
$ Start_ID : chr "5004572" "5004573" "5002210" "5002209" ...
$ Start_Desc : chr "BEULAH ST + CHARLES ARRINGTON DR" "WALKER LN + #6363" "WALKER LN + BEULAH ST" "BEULAH ST + CHARLES ARRINGTON DR" ...
$ StopID_Clean : chr "5004573" "5002210" "5002209" "5000070" ...
$ StopID_Indicator : Factor w/ 2 levels "ID_Bad","ID_OK": 2 2 2 2 2 2 2 2 2 2 ...
$ Stop_Desc : chr "WALKER LN + #6363" "WALKER LN + BEULAH ST" "BEULAH ST + CHARLES ARRINGTON DR" "FRANCONIA-SPRGFLD STA. + BUS BAY D" ...
$ Event_Type : int 4 4 4 3 3 4 4 4 4 4 ...
$ Event_Description : Factor w/ 3 levels "Serviced Stop ",..: 3 3 3 1 1 3 3 3 3 3 ...
$ Event_Time_Yr : int 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ...
$ Event_Time_Mth : int 10 10 10 10 10 10 10 10 10 10 ...
$ Event_Time_Date : int 3 3 3 3 3 3 3 3 3 3 ...
$ Event_Time_Day : Ord.factor w/ 7 levels "Sun"<"Mon"<"Tues"<..: 2 2 2 2 2 2 2 2 2 2 ...
$ Event_Time_Hr : int 6 6 6 6 6 6 6 6 6 6 ...
$ Event_Time_HrGroup : Ord.factor w/ 8 levels "Group0_2"<"Group3_5"<..: 3 3 3 3 3 3 3 3 3 3 ...
$ Event_Time_Min : int 9 10 10 13 14 21 21 23 23 26 ...
$ Event_Time : POSIXct, format: "2016-10-03 06:09:47" "2016-10-03 06:10:24" ...
$ Departure_Time : POSIXct, format: "2016-10-03 06:09:47" "2016-10-03 06:10:24" ...
$ Dwell_Time : int 0 0 0 0 104 0 0 0 0 0 ...
$ Dwell_Time2 : num 0 0 0 0 104 0 0 0 0 0 ...
$ Delta_Time : int 24 165 25 73 719 74 76 63 69 165 ...
$ Latitude : num 38.8 38.8 38.8 38.8 38.8 ...
$ Longitude : num -77.2 -77.2 -77.2 -77.2 -77.2 ...
$ Heading : int 97 276 15 119 100 274 104 241 274 1 ...
$ Odometer_Distance : int 45139 46418 50115 51074 51303 55633 56163 56285 57262 58363 ...
$ Odometer_Distance_Lag1: int 43543 45139 46418 50115 51074 51303 55633 56163 56285 57262 ...
$ Odometer_Distance_Mi : num 8.55 8.79 9.49 9.67 9.72 ...
$ TravelDistance_Ft : int 1596 1279 3697 959 229 4330 530 122 977 1101 ...
$ TravelDistance_Mi : num 0.3023 0.2422 0.7002 0.1816 0.0434 ...
$ TravelDistance_Mi_Hvrs: num 0.15 0.105 0.165 0.832 0.068 ...
$ TD_Mi_q2 : num 0.0521 0.0521 0.0521 0.0521 0.0521 ...
$ TD_Mi_q98 : num 0.959 0.959 0.959 0.959 0.959 ...
$ TD_Mi_SS_q5 : num 0.025246 0.242235 0.732434 0.079432 0.000436 ...
$ TD_Mi_SS_q95 : num 0.626 0.242 1.008 0.176 10.435 ...
$ TD_Mi_SSHG_q5 : num 0.09956 0.24223 0.70019 0.18163 0.00269 ...
$ TD_Mi_SSHG_q95 : num 0.627 0.242 0.7 0.182 0.497 ...
$ TD_Mi_Mean : num 0.308 0.308 0.308 0.308 0.308 ...
$ TD_Mi_Mean_F : num 0.232 0.232 0.232 0.232 0.232 ...
$ TD_Mi_SS_Mean : num 0.437 0.242 0.908 0.128 1.166 ...
$ TD_Mi_SS_Mean_F : num 0.457 0.242 0.977 NaN 0.226 ...
$ TD_Mi_SSHG_Mean : num 0.442 0.242 0.7 0.182 0.232 ...
$ TD_Mi_SSHG_Mean_F : num 0.491 0.242 0.7 0.182 0.228 ...
$ TD_Mi_Med : num 0.198 0.198 0.198 0.198 0.198 ...
$ TD_Mi_Med_F : num 0.198 0.198 0.198 0.198 0.198 ...
$ TD_Mi_SS_Med : num 0.5116 0.2422 0.9616 0.1278 0.0426 ...
$ TD_Mi_SS_Med_F : num 0.5116 0.2422 1.0081 NA 0.0426 ...
$ TD_Mi_SSHG_Med : num 0.512 0.242 0.7 0.182 0.108 ...
$ TD_Mi_SSHG_Med_F : num 0.512 0.242 0.7 0.182 0.108 ...
$ TD_Mi_Cnt : int 2486795 2486795 2486795 2486795 2486795 2486795 2486795 2486795 2486795 2486795 ...
$ TD_Mi_Cnt_F : int 2387406 2387406 2387406 2387406 2387406 2387406 2387406 2387406 2387406 2387406 ...
$ TD_Mi_SS_Cnt : int 14 1 4 2 87 22 118 91 11 2 ...
$ TD_Mi_SS_Cnt_F : int 12 1 3 0 77 18 106 81 9 0 ...
$ TD_Mi_SSHG_Cnt : int 7 1 1 1 23 6 29 28 3 1 ...
$ TD_Mi_SSHG_Cnt_F : int 5 1 1 1 19 4 25 24 1 1 ...
$ TravelTime_Sec : num 180 37 25 190 29 288 52 76 8 189 ...
$ TT_Sec_q2 : num 10 10 10 10 10 10 10 10 10 10 ...
$ TT_Sec_q98 : num 349 349 349 349 349 349 349 349 349 349 ...
$ TT_Sec_SS_q5 : num 11.9 37 30.5 172.9 10 ...
$ TT_Sec_SS_q95 : num 346.3 37 75.8 189.1 1737.2 ...
$ TT_Sec_SSHG_q5 : num 59.6 37 25 190 11.6 236 51.5 55 8.8 189 ...
$ TT_Sec_SSHG_q95 : num 276 37 25 190 675 ...
$ TT_Sec_Mean : num 105 105 105 105 105 ...
$ TT_Sec_Mean_F : num 56.6 56.6 56.6 56.6 56.6 ...
$ TT_Sec_SS_Mean : num 215.8 37 58.2 181 585.3 ...
$ TT_Sec_SS_Mean_F : num 218.9 37 65.5 NaN 249.3 ...
$ TT_Sec_SSHG_Mean : num 202 37 25 190 257 ...
$ TT_Sec_SSHG_Mean_F : num 226 37 25 190 244 ...
$ TT_Sec_Med : num 39 39 39 39 39 39 39 39 39 39 ...
$ TT_Sec_Med_F : num 39 39 39 39 39 39 39 39 39 39 ...
$ TT_Sec_SS_Med : num 223.5 37 65.5 181 33 ...
$ TT_Sec_SS_Med_F : num 223.5 37 65.5 NA 32 ...
$ TT_Sec_SSHG_Med : num 219 37 25 190 134 286 60 65 16 189 ...
$ TT_Sec_SSHG_Med_F : num 219 37 25 190 134 286 60 65 16 189 ...
$ TT_Sec_Cnt : int 2802888 2802888 2802888 2802888 2802888 2802888 2802888 2802888 2802888 2802888 ...
$ TT_Sec_Cnt_F : int 2705189 2705189 2705189 2705189 2705189 2705189 2705189 2705189 2705189 2705189 ...
$ TT_Sec_SS_Cnt : int 14 1 4 2 173 22 141 141 11 2 ...
$ TT_Sec_SS_Cnt_F : int 12 1 2 0 156 18 127 128 9 0 ...
$ TT_Sec_SSHG_Cnt : int 7 1 1 1 35 6 36 35 3 1 ...
$ TT_Sec_SSHG_Cnt_F : int 5 1 1 1 31 4 32 32 1 1 ...
$ TravelTime_Hr : num 0.05 0.01028 0.00694 0.05278 0.00806 ...
$ TT_Hr_q2 : num 0.00278 0.00278 0.00278 0.00278 0.00278 ...
$ TT_Hr_q98 : num 0.0969 0.0969 0.0969 0.0969 0.0969 ...
$ TT_Hr_SS_q5 : num 0.00331 0.01028 0.00849 0.04803 0.00278 ...
$ TT_Hr_SS_q95 : num 0.0962 0.0103 0.0211 0.0525 0.4826 ...
$ TT_Hr_SSHG_q5 : num 0.01656 0.01028 0.00694 0.05278 0.00322 ...
$ TT_Hr_SSHG_q95 : num 0.07653 0.01028 0.00694 0.05278 0.18739 ...
$ TT_Hr_Mean : num 0.0291 0.0291 0.0291 0.0291 0.0291 ...
$ TT_Hr_Mean_F : num 0.0157 0.0157 0.0157 0.0157 0.0157 ...
$ TT_Hr_SS_Mean : num 0.0599 0.0103 0.0162 0.0503 0.1626 ...
$ TT_Hr_SS_Mean_F : num 0.0608 0.0103 0.0182 NaN 0.0692 ...
[list output truncated]
summary(TravelDistance_Mi_NoNA)
RowNum_OG group StartStop_ID BusDay_EventNum Bus_ID
Min. : 3 1:496190 Length:2486795 Min. : 2.0 Min. : 11
1st Qu.: 786568 2:497932 Class :character 1st Qu.: 115.0 1st Qu.:2923
Median :1590497 3:501611 Mode :character Median : 251.0 Median :6202
Mean :1578192 4:495069 Mean : 293.2 Mean :5431
3rd Qu.:2351264 5:495993 3rd Qu.: 431.0 3rd Qu.:7113
Max. :3119443 Max. :1344.0 Max. :8105
Route RteChange2 RouteAlt DirChange2
Length:2486795 Change: 13709 2 :994645 Change: 48443
Class :character Same :2473086 1 :943279 Same :2438352
Mode :character 3 :229032
4 :117090
5 : 67811
6 : 51391
(Other): 83547
Route_Direction Stop_Sequence Start_ID Start_Desc
SOUTH :667198 Min. : 1.00 Length:2486795 Length:2486795
NORTH :662471 1st Qu.: 12.00 Class :character Class :character
WEST :565616 Median : 24.00 Mode :character Mode :character
EAST :543386 Mean : 27.13
LOOP : 33484 3rd Qu.: 39.00
CLOCKWIS: 7012 Max. :104.00
(Other) : 7628
StopID_Clean StopID_Indicator Stop_Desc Event_Type
Length:2486795 ID_Bad: 14271 Length:2486795 Min. :3.000
Class :character ID_OK :2472524 Class :character 1st Qu.:3.000
Mode :character Mode :character Median :4.000
Mean :3.626
3rd Qu.:4.000
Max. :5.000
Event_Description Event_Time_Yr
Serviced Stop : 930934 Min. :2016
Unknown Stop : 1794 1st Qu.:2016
UnServiced Stop :1554067 Median :2016
Mean :2016
3rd Qu.:2016
Max. :2016
Event_Time_Mth Event_Time_Date Event_Time_Day Event_Time_Hr Event_Time_HrGroup
Min. :10 Min. :3.000 Sun : 0 Min. : 0.00 Group6_8 :538348
1st Qu.:10 1st Qu.:4.000 Mon :496190 1st Qu.: 8.00 Group15_17:497156
Median :10 Median :5.000 Tues :497932 Median :13.00 Group18_20:408957
Mean :10 Mean :4.999 Wed :501611 Mean :12.99 Group9_11 :351804
3rd Qu.:10 3rd Qu.:6.000 Thurs:495069 3rd Qu.:18.00 Group12_14:314050
Max. :10 Max. :7.000 Fri :495993 Max. :23.00 Group21_23:217259
Sat : 0 (Other) :159221
Event_Time_Min Event_Time Departure_Time
Min. : 0.00 Min. :2016-10-03 00:00:09 Min. :2016-10-03 00:00:09
1st Qu.:14.00 1st Qu.:2016-10-04 08:35:52 1st Qu.:2016-10-04 08:35:59
Median :29.00 Median :2016-10-05 13:46:00 Median :2016-10-05 13:46:06
Mean :29.43 Mean :2016-10-05 13:27:43 Mean :2016-10-05 13:27:49
3rd Qu.:44.00 3rd Qu.:2016-10-06 17:57:32 3rd Qu.:2016-10-06 17:57:39
Max. :59.00 Max. :2016-10-07 23:59:59 Max. :2016-10-08 00:12:31
Dwell_Time Dwell_Time2 Delta_Time Latitude
Min. : 0.00 Min. : 0.000 Min. :-5606.0 Min. : 0.00
1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 16.0 1st Qu.:38.86
Median : 0.00 Median : 0.000 Median : 160.0 Median :38.90
Mean : 11.86 Mean : 5.994 Mean : 274.1 Mean :38.91
3rd Qu.: 4.00 3rd Qu.: 4.000 3rd Qu.: 402.0 3rd Qu.:38.96
Max. :6205.00 Max. :6205.000 Max. : 9426.0 Max. :39.19
Longitude Heading Odometer_Distance Odometer_Distance_Lag1
Min. :-77.45 Min. : 0.0 Min. : 1 Min. : 0
1st Qu.:-77.07 1st Qu.: 89.0 1st Qu.: 200268 1st Qu.: 198635
Median :-77.01 Median :180.0 Median : 394700 Median : 393026
Mean :-77.02 Mean :176.7 Mean : 443225 Mean : 441601
3rd Qu.:-76.97 3rd Qu.:269.0 3rd Qu.: 633936 3rd Qu.: 632313
Max. : 0.00 Max. :360.0 Max. :11108034 Max. :10853226
Odometer_Distance_Mi TravelDistance_Ft TravelDistance_Mi TravelDistance_Mi_Hvrs
Min. : 0.0002 Min. : 1 Min. : 0.00019 Min. : 0.0000
1st Qu.: 37.9295 1st Qu.: 699 1st Qu.: 0.13239 1st Qu.: 0.1034
Median : 74.7538 Median : 1044 Median : 0.19773 Median : 0.1378
Mean : 83.9442 Mean : 1624 Mean : 0.30760 Mean : 0.1918
3rd Qu.: 120.0635 3rd Qu.: 1518 3rd Qu.: 0.28750 3rd Qu.: 0.1828
Max. :2103.7943 Max. :1323464 Max. :250.65606 Max. :24.1507
TD_Mi_q2 TD_Mi_q98 TD_Mi_SS_q5 TD_Mi_SS_q95
Min. :0.05208 Min. :0.9585 Min. : 0.00019 Min. : 0.00019
1st Qu.:0.05208 1st Qu.:0.9585 1st Qu.: 0.08848 1st Qu.: 0.25878
Median :0.05208 Median :0.9585 Median : 0.10608 Median : 0.32239
Mean :0.05208 Mean :0.9585 Mean : 0.16872 Mean : 0.47949
3rd Qu.:0.05208 3rd Qu.:0.9585 3rd Qu.: 0.13977 3rd Qu.: 0.42822
Max. :0.05208 Max. :0.9585 Max. :219.16288 Max. :246.94938
TD_Mi_SSHG_q5 TD_Mi_SSHG_q95 TD_Mi_Mean TD_Mi_Mean_F
Min. : 0.00019 Min. : 0.00019 Min. :0.3076 Min. :0.2318
1st Qu.: 0.09167 1st Qu.: 0.24754 1st Qu.:0.3076 1st Qu.:0.2318
Median : 0.11395 Median : 0.31174 Median :0.3076 Median :0.2318
Mean : 0.18528 Mean : 0.46625 Mean :0.3076 Mean :0.2318
3rd Qu.: 0.15093 3rd Qu.: 0.41899 3rd Qu.:0.3076 3rd Qu.:0.2318
Max. :250.65606 Max. :250.65606 Max. :0.3076 Max. :0.2318
TD_Mi_SS_Mean TD_Mi_SS_Mean_F TD_Mi_SSHG_Mean TD_Mi_SSHG_Mean_F
Min. : 0.00019 Min. : 0.0002 Min. : 0.00019 Min. : 0.000
1st Qu.: 0.17129 1st Qu.: 0.1663 1st Qu.: 0.16760 1st Qu.: 0.163
Median : 0.21082 Median : 0.2058 Median : 0.20965 Median : 0.206
Mean : 0.30760 Mean : 0.2916 Mean : 0.30760 Mean : 0.294
3rd Qu.: 0.26422 3rd Qu.: 0.2582 3rd Qu.: 0.26616 3rd Qu.: 0.262
Max. :219.16288 Max. :219.1629 Max. :250.65606 Max. :250.656
NA's :2678 NA's :4904
TD_Mi_Med TD_Mi_Med_F TD_Mi_SS_Med TD_Mi_SS_Med_F
Min. :0.1977 Min. :0.1977 Min. : 0.00019 Min. : 0.0002
1st Qu.:0.1977 1st Qu.:0.1977 1st Qu.: 0.14602 1st Qu.: 0.1458
Median :0.1977 Median :0.1977 Median : 0.19470 Median : 0.1947
Mean :0.1977 Mean :0.1977 Mean : 0.28931 Mean : 0.2827
3rd Qu.:0.1977 3rd Qu.:0.1977 3rd Qu.: 0.26326 3rd Qu.: 0.2633
Max. :0.1977 Max. :0.1977 Max. :219.16288 Max. :219.1629
NA's :2678
TD_Mi_SSHG_Med TD_Mi_SSHG_Med_F TD_Mi_Cnt TD_Mi_Cnt_F
Min. : 0.00019 Min. : 0.000 Min. :2486795 Min. :2387406
1st Qu.: 0.14403 1st Qu.: 0.144 1st Qu.:2486795 1st Qu.:2387406
Median : 0.19527 Median : 0.195 Median :2486795 Median :2387406
Mean : 0.29152 Mean : 0.285 Mean :2486795 Mean :2387406
3rd Qu.: 0.26657 3rd Qu.: 0.266 3rd Qu.:2486795 3rd Qu.:2387406
Max. :250.65606 Max. :250.656 Max. :2486795 Max. :2387406
NA's :4904
TD_Mi_SS_Cnt TD_Mi_SS_Cnt_F TD_Mi_SSHG_Cnt TD_Mi_SSHG_Cnt_F
Min. : 1.0 Min. : 0.0 Min. : 1.00 Min. : 0.00
1st Qu.: 178.0 1st Qu.: 160.0 1st Qu.: 28.00 1st Qu.: 24.00
Median : 295.0 Median : 266.0 Median : 48.00 Median : 42.00
Mean : 363.3 Mean : 327.1 Mean : 60.01 Mean : 53.31
3rd Qu.: 476.0 3rd Qu.: 428.0 3rd Qu.: 78.00 3rd Qu.: 70.00
Max. :1543.0 Max. :1388.0 Max. :663.00 Max. :595.00
TravelTime_Sec TT_Sec_q2 TT_Sec_q98 TT_Sec_SS_q5 TT_Sec_SS_q95
Min. : 1 Min. :10 Min. :349 Min. : 1.00 Min. : 1.00
1st Qu.: 24 1st Qu.:10 1st Qu.:349 1st Qu.: 15.00 1st Qu.: 47.00
Median : 38 Median :10 Median :349 Median : 21.00 Median : 77.75
Mean : 100 Mean :10 Mean :349 Mean : 57.38 Mean : 176.22
3rd Qu.: 70 3rd Qu.:10 3rd Qu.:349 3rd Qu.: 32.00 3rd Qu.: 129.65
Max. :54551 Max. :10 Max. :349 Max. :54551.00 Max. :54551.00
NA's :28
TT_Sec_SSHG_q5 TT_Sec_SSHG_q95 TT_Sec_Mean TT_Sec_Mean_F
Min. : 1.00 Min. : 1.00 Min. :104.9 Min. :56.61
1st Qu.: 15.20 1st Qu.: 42.70 1st Qu.:104.9 1st Qu.:56.61
Median : 22.50 Median : 70.55 Median :104.9 Median :56.61
Mean : 62.94 Mean : 161.25 Mean :104.9 Mean :56.61
3rd Qu.: 34.80 3rd Qu.: 119.60 3rd Qu.:104.9 3rd Qu.:56.61
Max. :54551.00 Max. :54551.00 Max. :104.9 Max. :56.61
TT_Sec_SS_Mean TT_Sec_SS_Mean_F TT_Sec_SSHG_Mean TT_Sec_SSHG_Mean_F
Min. : 1.00 Min. : 1.00 Min. : 1.00 Min. : 1.00
1st Qu.: 28.20 1st Qu.: 26.62 1st Qu.: 27.51 1st Qu.: 26.34
Median : 42.61 Median : 40.46 Median : 41.76 Median : 39.96
Mean : 99.62 Mean : 86.96 Mean : 99.55 Mean : 88.81
3rd Qu.: 69.71 3rd Qu.: 66.44 3rd Qu.: 70.02 3rd Qu.: 67.22
Max. :54551.00 Max. :54551.00 Max. :54551.00 Max. :54551.00
NA's :2603 NA's :3772
TT_Sec_Med TT_Sec_Med_F TT_Sec_SS_Med TT_Sec_SS_Med_F
Min. :39 Min. :39 Min. : 1.00 Min. : 1.00
1st Qu.:39 1st Qu.:39 1st Qu.: 25.00 1st Qu.: 25.00
Median :39 Median :39 Median : 37.00 Median : 37.00
Mean :39 Mean :39 Mean : 86.88 Mean : 80.62
3rd Qu.:39 3rd Qu.:39 3rd Qu.: 62.00 3rd Qu.: 62.00
Max. :39 Max. :39 Max. :54551.00 Max. :54551.00
NA's :2603
TT_Sec_SSHG_Med TT_Sec_SSHG_Med_F TT_Sec_Cnt TT_Sec_Cnt_F
Min. : 1.00 Min. : 1.00 Min. :2802888 Min. :2705189
1st Qu.: 25.00 1st Qu.: 25.00 1st Qu.:2802888 1st Qu.:2705189
Median : 37.00 Median : 37.00 Median :2802888 Median :2705189
Mean : 90.07 Mean : 83.87 Mean :2802888 Mean :2705189
3rd Qu.: 64.00 3rd Qu.: 64.00 3rd Qu.:2802888 3rd Qu.:2705189
Max. :54551.00 Max. :54551.00 Max. :2802888 Max. :2705189
NA's :3772
TT_Sec_SS_Cnt TT_Sec_SS_Cnt_F TT_Sec_SSHG_Cnt TT_Sec_SSHG_Cnt_F
Min. : 1.0 Min. : 0.0 Min. : 1.0 Min. : 0.00
1st Qu.: 200.0 1st Qu.: 183.0 1st Qu.: 30.0 1st Qu.: 27.00
Median : 321.0 Median : 292.0 Median : 52.0 Median : 47.00
Mean : 392.4 Mean : 357.2 Mean : 64.7 Mean : 58.23
3rd Qu.: 509.0 3rd Qu.: 464.0 3rd Qu.: 84.0 3rd Qu.: 76.00
Max. :1664.0 Max. :1523.0 Max. :691.0 Max. :634.00
TravelTime_Hr TT_Hr_q2 TT_Hr_q98 TT_Hr_SS_q5
Min. : 0.000278 Min. :0.002778 Min. :0.09694 Min. : 0.000278
1st Qu.: 0.006667 1st Qu.:0.002778 1st Qu.:0.09694 1st Qu.: 0.004167
Median : 0.010556 Median :0.002778 Median :0.09694 Median : 0.005833
Mean : 0.027782 Mean :0.002778 Mean :0.09694 Mean : 0.015938
3rd Qu.: 0.019444 3rd Qu.:0.002778 3rd Qu.:0.09694 3rd Qu.: 0.008889
Max. :15.153056 Max. :0.002778 Max. :0.09694 Max. :15.153056
NA's :28
TT_Hr_SS_q95 TT_Hr_SSHG_q5 TT_Hr_SSHG_q95 TT_Hr_Mean
Min. : 0.000278 Min. : 0.000278 Min. : 0.000278 Min. :0.02913
1st Qu.: 0.013056 1st Qu.: 0.004222 1st Qu.: 0.011861 1st Qu.:0.02913
Median : 0.021597 Median : 0.006250 Median : 0.019597 Median :0.02913
Mean : 0.048950 Mean : 0.017485 Mean : 0.044792 Mean :0.02913
3rd Qu.: 0.036014 3rd Qu.: 0.009667 3rd Qu.: 0.033222 3rd Qu.:0.02913
Max. :15.153056 Max. :15.153056 Max. :15.153056 Max. :0.02913
TT_Hr_Mean_F TT_Hr_SS_Mean TT_Hr_SS_Mean_F TT_Hr_SSHG_Mean
Min. :0.01573 Min. : 0.000278 Min. : 0.0003 Min. : 0.000278
1st Qu.:0.01573 1st Qu.: 0.007832 1st Qu.: 0.0074 1st Qu.: 0.007643
Median :0.01573 Median : 0.011836 Median : 0.0112 Median : 0.011600
Mean :0.01573 Mean : 0.027673 Mean : 0.0242 Mean : 0.027654
3rd Qu.:0.01573 3rd Qu.: 0.019363 3rd Qu.: 0.0185 3rd Qu.: 0.019450
Max. :0.01573 Max. :15.153056 Max. :15.1531 Max. :15.153056
NA's :2612
TT_Hr_SSHG_Mean_F TT_Hr_Med TT_Hr_Med_F TT_Hr_SS_Med
Min. : 0.000 Min. :0.01083 Min. :0.01083 Min. : 0.000278
1st Qu.: 0.007 1st Qu.:0.01083 1st Qu.:0.01083 1st Qu.: 0.006944
Median : 0.011 Median :0.01083 Median :0.01083 Median : 0.010278
Mean : 0.025 Mean :0.01083 Mean :0.01083 Mean : 0.024132
3rd Qu.: 0.019 3rd Qu.:0.01083 3rd Qu.:0.01083 3rd Qu.: 0.017222
Max. :15.153 Max. :0.01083 Max. :0.01083 Max. :15.153056
NA's :3842
TT_Hr_SS_Med_F TT_Hr_SSHG_Med TT_Hr_SSHG_Med_F TT_Hr_Cnt
Min. : 0.0003 Min. : 0.000278 Min. : 0.000 Min. :2802888
1st Qu.: 0.0069 1st Qu.: 0.006944 1st Qu.: 0.007 1st Qu.:2802888
Median : 0.0103 Median : 0.010278 Median : 0.010 Median :2802888
Mean : 0.0224 Mean : 0.025019 Mean : 0.023 Mean :2802888
3rd Qu.: 0.0172 3rd Qu.: 0.017778 3rd Qu.: 0.018 3rd Qu.:2802888
Max. :15.1531 Max. :15.153056 Max. :15.153 Max. :2802888
NA's :2612 NA's :3842
TT_Hr_Cnt_F TT_Hr_SS_Cnt TT_Hr_SS_Cnt_F TT_Hr_SSHG_Cnt TT_Hr_SSHG_Cnt_F
Min. :2705189 Min. : 1.0 Min. : 0 Min. : 1.0 Min. : 0.00
1st Qu.:2705189 1st Qu.: 200.0 1st Qu.: 183 1st Qu.: 30.0 1st Qu.: 27.00
Median :2705189 Median : 321.0 Median : 292 Median : 52.0 Median : 47.00
Mean :2705189 Mean : 392.4 Mean : 357 Mean : 64.7 Mean : 58.19
3rd Qu.:2705189 3rd Qu.: 509.0 3rd Qu.: 464 3rd Qu.: 84.0 3rd Qu.: 76.00
Max. :2705189 Max. :1664.0 Max. :1523 Max. :691.0 Max. :634.00
SpeedAvg_Mph
Min. : 0.00
1st Qu.: 10.10
Median : 16.68
Mean : 26.54
3rd Qu.: 31.17
Max. :22924.09
NA's :28
Investigation of TravelDistance_Mi.
Let’s plot just the TravelDistance_Mi values that are NOT “NA”.
TravDistMi_HistDen <- ggplot(select(TravelDistance_Mi_NoNA,
TravelDistance_Mi
),
aes(x = TravelDistance_Mi,
y = ..density..
)
) +
geom_histogram(binwidth = 0.05, fill = "lightblue", colour = "grey60", size = 0.2) +
geom_line(stat = "density", colour = "red") +
coord_cartesian(xlim = c(0, 1.5), ylim = c(0, 4.0)
) +
labs(title = "Variation in Distance Between Stops",
x = "Travel Distance (miles)",
y = "Density"
)
TravDistMi_HistDen
Investigation of TravelDistance_Mi.
Looking at the extremely large TravelDistance_Mi values. Some (aprox 27%) of TravelDistance_Mi values > 1 mile are when the DirChange2 changes…but what about the other ~73%?
rm(TravelDistance_Mi_NoNA)
# examples of weirdly large TravelDistance_Mi
View(filter(AllDays_NewOrder,
TravelDistance_Mi > 1.1587121212 # 1.1587121212 is the 99th percentile
) %>%
arrange(desc(TravelDistance_Mi)
)
)
# Why are these extremes? Airports? Bus collection points?
View(filter(AllDays_NewOrder,
between(RowNum_OG, 494044, 494064) | # 494054
between(RowNum_OG, 494273, 494293) | # 494283
between(RowNum_OG, 494626, 494646) | # 494636
between(RowNum_OG, 1610156, 1610176) | # 1610166
between(RowNum_OG, 2073074, 2073094) # 2073084
)
)
# Before Removing Runs
# View(filter(AllDays_Sorted,
# between(RowNum_OG, 494044, 494064) | # 494054
# between(RowNum_OG, 494273, 494293) | # 494283
# between(RowNum_OG, 494626, 494646) | # 494636
# between(RowNum_OG, 1610156, 1610176) | # 1610166
# between(RowNum_OG, 2073074, 2073094) # 2073084
# )
# )
# After Removing Runs
# View(filter(AllDays_FirstStopID,
# between(RowNum_OG, 494044, 494064) | # 494054
# between(RowNum_OG, 494273, 494293) | # 494283
# between(RowNum_OG, 494626, 494646) | # 494636
# between(RowNum_OG, 1610156, 1610176) | # 1610166
# between(RowNum_OG, 2073074, 2073094) # 2073084
# )
# )
Investigation of TravelDistance_Mi.
Any relation with DirChange2? Doesn’t look as if this is so.
ExtremeTravDist <- filter(AllDays_NewOrder,
!is.na(TravelDistance_Mi)
) %>%
mutate(TravDist_Extreme = ifelse(TravelDistance_Mi > 1.1587121212, # 1.1587121212 is the 99th percentile
"True",
"False"
)
) %>%
group_by(DirChange2, TravDist_Extreme) %>%
summarise(TravDistMI_ExtCnts = n()
)
# ExtremeTravDist
ExtremeTravDist_Spread <- as.data.frame(spread(ExtremeTravDist,
TravDist_Extreme,
TravDistMI_ExtCnts
)
) %>%
select(False,
True
)
row.names(ExtremeTravDist_Spread) <- c("Change", "Same")
# str(ExtremeTravDist_Spread)
# ExtremeTravDist_Spread
prop.table(as.table(as.matrix(ExtremeTravDist_Spread)
),
1
)
False True
Change 0.80622587 0.19377413
Same 0.98855456 0.01144544
prop.table(as.table(as.matrix(ExtremeTravDist_Spread)
),
2
)
False True
Change 0.01594448 0.25169594
Same 0.98405552 0.74830406
Investigation of TravelDistance_Mi.
Looking at specific buses and StartStop_ID.
rm(ExtremeTravDist, ExtremeTravDist_Spread)
View(arrange(group_by(AllDays_NewOrder,
Bus_ID
) %>%
summarise(DistTrav_Mean = mean(TravelDistance_Mi, na.rm = TRUE),
DistTrav_Med = median(TravelDistance_Mi, na.rm = TRUE)
),
desc(DistTrav_Med)
)
)
# example of extremely small TravelDistance_Mi values (looks like the odometer wasn't functioning)
View(filter(AllDays_NewOrder,
Bus_ID == 6111 |
Bus_ID == 7201 |
Bus_ID == 8058
) %>%
arrange(Bus_ID, Event_Time)
)
View(arrange(group_by(AllDays_NewOrder,
StartStop_ID
) %>%
summarise(DistTrav_Mean = mean(TravelDistance_Mi, na.rm = TRUE),
DistTrav_Med = median(TravelDistance_Mi, na.rm = TRUE)
),
desc(DistTrav_Med)
)
)
# example of extremely large TravelDistance_Mi values...no idea why...
View(filter(AllDays_NewOrder,
StartStop_ID == "1003665--12" |
StartStop_ID == "1003665--5001925" |
StartStop_ID == "3001038--3002565"
) %>%
arrange(StartStop_ID, Event_Time)
)
Investigation of TravelDistance_Mi & TravelDistance_Mi_New.
If TravelDisntace_Mi is below the 5th percentile for that StartStop_ID, or if TravelDisntace_Mi is above the 95th percentile for that StartStop_ID, or if TravelDistance_Mi is NA (when the BusDay_EventNum !=1), consider this an outlier. In this case, replace the value with the mean for that StartStop_ID and HourGroup (TD_Mi_SSHG_Mean_F), or if there are not enough values at the HourGroup level, replace it with the mean for that StartStop_ID.
# View(tail(AllDays_NewOrder, 500))
AllDays_NewTravelDist <-
mutate(AllDays_NewOrder,
TravelDistance_Mi_New = ifelse(!is.na(TravelDistance_Mi) &
(TravelDistance_Mi < TD_Mi_SSHG_q5 |
TravelDistance_Mi > TD_Mi_SSHG_q95
) &
TD_Mi_SSHG_Cnt_F >= 20,
TD_Mi_SSHG_Mean_F,
ifelse(!is.na(TravelDistance_Mi) &
(TravelDistance_Mi < TD_Mi_SSHG_q5 |
TravelDistance_Mi > TD_Mi_SSHG_q95
) &
TD_Mi_SSHG_Cnt_F < 20 &
TD_Mi_SS_Cnt_F >= 20,
TD_Mi_SS_Mean_F,
ifelse(!is.na(TravelDistance_Mi) &
(TravelDistance_Mi < TD_Mi_SSHG_q5 |
TravelDistance_Mi > TD_Mi_SSHG_q95
) &
TD_Mi_SS_Cnt_F < 20 &
TD_Mi_SS_Cnt >= 20,
TD_Mi_SS_Mean,
ifelse(is.na(TravelDistance_Mi) &
BusDay_EventNum != 1 &
TravelDistance_Mi_Hvrs != 0,
TravelDistance_Mi_Hvrs,
ifelse(is.na(TravelDistance_Mi) &
BusDay_EventNum != 1 &
TravelDistance_Mi_Hvrs == 0,
TD_Mi_SS_Mean,
TravelDistance_Mi
))))),
TravelDistance_Mi_New_Label =
factor(ifelse(!is.na(TravelDistance_Mi) &
(TravelDistance_Mi < TD_Mi_SSHG_q5 |
TravelDistance_Mi > TD_Mi_SSHG_q95
) &
TD_Mi_SSHG_Cnt_F >= 20,
"TD_Mi_SSHG_Mean_F",
ifelse(!is.na(TravelDistance_Mi) &
(TravelDistance_Mi < TD_Mi_SSHG_q5 |
TravelDistance_Mi > TD_Mi_SSHG_q95
) &
TD_Mi_SSHG_Cnt_F < 20 &
TD_Mi_SS_Cnt_F >= 20,
"TD_Mi_SS_Mean_F",
ifelse(!is.na(TravelDistance_Mi) &
(TravelDistance_Mi < TD_Mi_SSHG_q5 |
TravelDistance_Mi > TD_Mi_SSHG_q95
) &
TD_Mi_SS_Cnt_F < 20 &
TD_Mi_SS_Cnt >= 20,
"TD_Mi_SS_Mean",
ifelse(is.na(TravelDistance_Mi) &
BusDay_EventNum != 1 &
TravelDistance_Mi_Hvrs != 0,
"TravelDistance_Mi_Hvrs",
ifelse(is.na(TravelDistance_Mi) &
BusDay_EventNum != 1 &
TravelDistance_Mi_Hvrs == 0,
"TD_Mi_SS_Mean",
"TravelDistance_Mi"
)))))
),
TravelDistance_Mi_NewHvrs = ifelse(!is.na(TravelDistance_Mi_Hvrs) &
TravelDistance_Mi_Hvrs != 0 &
(TravelDistance_Mi_New < TD_Mi_q2 |
TravelDistance_Mi_New > TD_Mi_q98
),
TravelDistance_Mi_Hvrs,
TravelDistance_Mi_New
),
TravelDistance_Mi_NewHvrs_Label =
factor(ifelse(!is.na(TravelDistance_Mi_Hvrs) &
TravelDistance_Mi_Hvrs != 0 &
(TravelDistance_Mi_New < TD_Mi_q2 |
TravelDistance_Mi_New > TD_Mi_q98
),
"TravelDistance_Mi_Hvrs",
as.character(TravelDistance_Mi_New_Label)
)
),
SpeedAvg_Mph_NewHvrs = TravelDistance_Mi_NewHvrs / TravelTime_Hr
)
str(AllDays_NewTravelDist)
'data.frame': 2809529 obs. of 119 variables:
$ RowNum_OG : int 1 3 4 5 6 7 9 10 11 12 ...
$ group : Factor w/ 5 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
$ StartStop_ID : chr "NULL--5004572" "5004572--5004573" "5004573--5002210" "5002210--5002209" ...
$ BusDay_EventNum : int 1 2 3 4 5 6 7 8 9 10 ...
$ Bus_ID : int 11 11 11 11 11 11 11 11 11 11 ...
$ Route : chr "S80" "S80" "S80" "S80" ...
$ RteChange2 : Factor w/ 2 levels "Change","Same": 1 2 2 2 2 2 2 2 2 2 ...
$ RouteAlt : Factor w/ 14 levels "1","10","11",..: 1 1 1 1 1 1 6 6 6 6 ...
$ DirChange2 : Factor w/ 2 levels "Change","Same": 1 2 2 2 2 2 1 2 2 2 ...
$ Route_Direction : Factor w/ 12 levels "","ANTICLKW",..: 6 6 6 6 6 6 6 6 6 6 ...
$ Stop_Sequence : int 7 6 3 2 8 1 2 3 4 2 ...
$ Start_ID : chr NA "5004572" "5004573" "5002210" ...
$ Start_Desc : chr NA "BEULAH ST + CHARLES ARRINGTON DR" "WALKER LN + #6363" "WALKER LN + BEULAH ST" ...
$ StopID_Clean : chr "5004572" "5004573" "5002210" "5002209" ...
$ StopID_Indicator : Factor w/ 2 levels "ID_Bad","ID_OK": 2 2 2 2 2 2 2 2 2 2 ...
$ Stop_Desc : chr "BEULAH ST + CHARLES ARRINGTON DR" "WALKER LN + #6363" "WALKER LN + BEULAH ST" "BEULAH ST + CHARLES ARRINGTON DR" ...
$ Event_Type : int 4 4 4 4 3 3 4 4 4 4 ...
$ Event_Description : Factor w/ 3 levels "Serviced Stop ",..: 3 3 3 3 1 1 3 3 3 3 ...
$ Event_Time_Yr : int 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ...
$ Event_Time_Mth : int 10 10 10 10 10 10 10 10 10 10 ...
$ Event_Time_Date : int 3 3 3 3 3 3 3 3 3 3 ...
$ Event_Time_Day : Ord.factor w/ 7 levels "Sun"<"Mon"<"Tues"<..: 2 2 2 2 2 2 2 2 2 2 ...
$ Event_Time_Hr : int 6 6 6 6 6 6 6 6 6 6 ...
$ Event_Time_HrGroup : Ord.factor w/ 8 levels "Group0_2"<"Group3_5"<..: 3 3 3 3 3 3 3 3 3 3 ...
$ Event_Time_Min : int 6 9 10 10 13 14 21 21 23 23 ...
$ Event_Time : POSIXct, format: "2016-10-03 06:06:47" "2016-10-03 06:09:47" ...
$ Departure_Time : POSIXct, format: "2016-10-03 06:06:47" "2016-10-03 06:09:47" ...
$ Dwell_Time : int 0 0 0 0 0 104 0 0 0 0 ...
$ Dwell_Time2 : num 0 0 0 0 0 104 0 0 0 0 ...
$ Delta_Time : int -177 24 165 25 73 719 74 76 63 69 ...
$ Latitude : num 38.8 38.8 38.8 38.8 38.8 ...
$ Longitude : num -77.2 -77.2 -77.2 -77.2 -77.2 ...
$ Heading : int 199 97 276 15 119 100 274 104 241 274 ...
$ Odometer_Distance : int 43543 45139 46418 50115 51074 51303 55633 56163 56285 57262 ...
$ Odometer_Distance_Lag1 : int NA 43543 45139 46418 50115 51074 51303 55633 56163 56285 ...
$ Odometer_Distance_Mi : num 8.25 8.55 8.79 9.49 9.67 ...
$ TravelDistance_Ft : int NA 1596 1279 3697 959 229 4330 530 122 977 ...
$ TravelDistance_Mi : num NA 0.302 0.242 0.7 0.182 ...
$ TravelDistance_Mi_Hvrs : num NA 0.15 0.105 0.165 0.832 ...
$ TD_Mi_q2 : num 0.0521 0.0521 0.0521 0.0521 0.0521 ...
$ TD_Mi_q98 : num 0.959 0.959 0.959 0.959 0.959 ...
$ TD_Mi_SS_q5 : num NA 0.0252 0.2422 0.7324 0.0794 ...
$ TD_Mi_SS_q95 : num NA 0.626 0.242 1.008 0.176 ...
$ TD_Mi_SSHG_q5 : num NA 0.0996 0.2422 0.7002 0.1816 ...
$ TD_Mi_SSHG_q95 : num NA 0.627 0.242 0.7 0.182 ...
$ TD_Mi_Mean : num 0.308 0.308 0.308 0.308 0.308 ...
$ TD_Mi_Mean_F : num 0.232 0.232 0.232 0.232 0.232 ...
$ TD_Mi_SS_Mean : num NaN 0.437 0.242 0.908 0.128 ...
$ TD_Mi_SS_Mean_F : num NaN 0.457 0.242 0.977 NaN ...
$ TD_Mi_SSHG_Mean : num NaN 0.442 0.242 0.7 0.182 ...
$ TD_Mi_SSHG_Mean_F : num NaN 0.491 0.242 0.7 0.182 ...
$ TD_Mi_Med : num 0.198 0.198 0.198 0.198 0.198 ...
$ TD_Mi_Med_F : num 0.198 0.198 0.198 0.198 0.198 ...
$ TD_Mi_SS_Med : num NA 0.512 0.242 0.962 0.128 ...
$ TD_Mi_SS_Med_F : num NA 0.512 0.242 1.008 NA ...
$ TD_Mi_SSHG_Med : num NA 0.512 0.242 0.7 0.182 ...
$ TD_Mi_SSHG_Med_F : num NA 0.512 0.242 0.7 0.182 ...
$ TD_Mi_Cnt : int 2486795 2486795 2486795 2486795 2486795 2486795 2486795 2486795 2486795 2486795 ...
$ TD_Mi_Cnt_F : int 2387406 2387406 2387406 2387406 2387406 2387406 2387406 2387406 2387406 2387406 ...
$ TD_Mi_SS_Cnt : int 0 14 1 4 2 87 22 118 91 11 ...
$ TD_Mi_SS_Cnt_F : int 0 12 1 3 0 77 18 106 81 9 ...
$ TD_Mi_SSHG_Cnt : int 0 7 1 1 1 23 6 29 28 3 ...
$ TD_Mi_SSHG_Cnt_F : int 0 5 1 1 1 19 4 25 24 1 ...
$ TravelTime_Sec : num NA 180 37 25 190 29 288 52 76 8 ...
$ TT_Sec_q2 : num 10 10 10 10 10 10 10 10 10 10 ...
$ TT_Sec_q98 : num 349 349 349 349 349 349 349 349 349 349 ...
$ TT_Sec_SS_q5 : num NA 11.9 37 30.5 172.9 ...
$ TT_Sec_SS_q95 : num NA 346.3 37 75.8 189.1 ...
$ TT_Sec_SSHG_q5 : num NA 59.6 37 25 190 11.6 236 51.5 55 8.8 ...
$ TT_Sec_SSHG_q95 : num NA 276 37 25 190 ...
$ TT_Sec_Mean : num 105 105 105 105 105 ...
$ TT_Sec_Mean_F : num 56.6 56.6 56.6 56.6 56.6 ...
$ TT_Sec_SS_Mean : num NaN 215.8 37 58.2 181 ...
$ TT_Sec_SS_Mean_F : num NaN 218.9 37 65.5 NaN ...
$ TT_Sec_SSHG_Mean : num NaN 202 37 25 190 ...
$ TT_Sec_SSHG_Mean_F : num NaN 226 37 25 190 ...
$ TT_Sec_Med : num 39 39 39 39 39 39 39 39 39 39 ...
$ TT_Sec_Med_F : num 39 39 39 39 39 39 39 39 39 39 ...
$ TT_Sec_SS_Med : num NA 223.5 37 65.5 181 ...
$ TT_Sec_SS_Med_F : num NA 223.5 37 65.5 NA ...
$ TT_Sec_SSHG_Med : num NA 219 37 25 190 134 286 60 65 16 ...
$ TT_Sec_SSHG_Med_F : num NA 219 37 25 190 134 286 60 65 16 ...
$ TT_Sec_Cnt : int 2802888 2802888 2802888 2802888 2802888 2802888 2802888 2802888 2802888 2802888 ...
$ TT_Sec_Cnt_F : int 2705189 2705189 2705189 2705189 2705189 2705189 2705189 2705189 2705189 2705189 ...
$ TT_Sec_SS_Cnt : int 0 14 1 4 2 173 22 141 141 11 ...
$ TT_Sec_SS_Cnt_F : int 0 12 1 2 0 156 18 127 128 9 ...
$ TT_Sec_SSHG_Cnt : int 0 7 1 1 1 35 6 36 35 3 ...
$ TT_Sec_SSHG_Cnt_F : int 0 5 1 1 1 31 4 32 32 1 ...
$ TravelTime_Hr : num NA 0.05 0.01028 0.00694 0.05278 ...
$ TT_Hr_q2 : num 0.00278 0.00278 0.00278 0.00278 0.00278 ...
$ TT_Hr_q98 : num 0.0969 0.0969 0.0969 0.0969 0.0969 ...
$ TT_Hr_SS_q5 : num NA 0.00331 0.01028 0.00849 0.04803 ...
$ TT_Hr_SS_q95 : num NA 0.0962 0.0103 0.0211 0.0525 ...
$ TT_Hr_SSHG_q5 : num NA 0.01656 0.01028 0.00694 0.05278 ...
$ TT_Hr_SSHG_q95 : num NA 0.07653 0.01028 0.00694 0.05278 ...
$ TT_Hr_Mean : num 0.0291 0.0291 0.0291 0.0291 0.0291 ...
$ TT_Hr_Mean_F : num 0.0157 0.0157 0.0157 0.0157 0.0157 ...
$ TT_Hr_SS_Mean : num NaN 0.0599 0.0103 0.0162 0.0503 ...
$ TT_Hr_SS_Mean_F : num NaN 0.0608 0.0103 0.0182 NaN ...
[list output truncated]
Investigation of TravelDistance_Mi & TravelDistance_Mi_Hvrs & TravelDistance_Mi_New.
Quick summary and then correlation calculation.
rm(AllDays_NewOrder)
# 38 rows meet this criteria anymore -- appears to be the case when both the Lat Long calculations, and the TravelDistance calculations did not function properly.
View(filter(AllDays_NewTravelDist,
is.na(TravelDistance_Mi_New) &
BusDay_EventNum != 1
)
)
View(AllDays_NewTravelDist %>%
arrange(desc(TravelDistance_Mi_New)) %>%
head(500)
)
summary(select(AllDays_NewTravelDist,
TravelDistance_Mi,
TravelDistance_Mi_Hvrs,
TravelDistance_Mi_New,
TravelDistance_Mi_NewHvrs
)
)
TravelDistance_Mi TravelDistance_Mi_Hvrs TravelDistance_Mi_New
Min. : 0.0 Min. : 0.000 Min. : 0.000
1st Qu.: 0.1 1st Qu.: 0.106 1st Qu.: 0.141
Median : 0.2 Median : 0.142 Median : 0.199
Mean : 0.3 Mean : 0.201 Mean : 0.298
3rd Qu.: 0.3 3rd Qu.: 0.193 3rd Qu.: 0.276
Max. :250.7 Max. :24.407 Max. :250.656
NA's :322734 NA's :6528 NA's :6566
TravelDistance_Mi_NewHvrs
Min. : 0.000
1st Qu.: 0.142
Median : 0.199
Mean : 0.259
3rd Qu.: 0.276
Max. :36.236
NA's :6566
summary(select(filter(AllDays_NewTravelDist,
BusDay_EventNum != 1
),
TravelDistance_Mi,
TravelDistance_Mi_Hvrs,
TravelDistance_Mi_New,
TravelDistance_Mi_NewHvrs
)
)
TravelDistance_Mi TravelDistance_Mi_Hvrs TravelDistance_Mi_New
Min. : 0.00 Min. : 0.0000 Min. : 0.00019
1st Qu.: 0.13 1st Qu.: 0.1055 1st Qu.: 0.14072
Median : 0.20 Median : 0.1424 Median : 0.19867
Mean : 0.31 Mean : 0.2008 Mean : 0.29751
3rd Qu.: 0.29 3rd Qu.: 0.1935 3rd Qu.: 0.27633
Max. :250.66 Max. :24.4068 Max. :250.65606
NA's :316206 NA's :38
TravelDistance_Mi_NewHvrs
Min. : 0.00019
1st Qu.: 0.14205
Median : 0.19903
Mean : 0.25859
3rd Qu.: 0.27557
Max. :36.23636
NA's :38
cor(select(AllDays_NewTravelDist,
TravelDistance_Mi,
TravelDistance_Mi_Hvrs,
TravelDistance_Mi_New,
TravelDistance_Mi_NewHvrs
),
use = "pairwise.complete.obs"
)
TravelDistance_Mi TravelDistance_Mi_Hvrs
TravelDistance_Mi 1.0000000 0.5447660
TravelDistance_Mi_Hvrs 0.5447660 1.0000000
TravelDistance_Mi_New 0.9513379 0.5837182
TravelDistance_Mi_NewHvrs 0.6005944 0.9005277
TravelDistance_Mi_New TravelDistance_Mi_NewHvrs
TravelDistance_Mi 0.9513379 0.6005944
TravelDistance_Mi_Hvrs 0.5837182 0.9005277
TravelDistance_Mi_New 1.0000000 0.6346981
TravelDistance_Mi_NewHvrs 0.6346981 1.0000000
Investigation of TravelDistance_Mi_NewHvrs_Label & TravelDistance_Mi_NewHvrs_Label.
Show how the labels changed.
group_by(AllDays_NewTravelDist,
TravelDistance_Mi_New_Label,
TravelDistance_Mi_NewHvrs_Label
) %>%
summarise(CntNum = n(),
CntPct = format(CntNum / nrow(AllDays_NewTravelDist),
scientific = 9999
)
) %>%
arrange(desc(CntPct)
)
Investigation of TravelDistance_Mi & TravelDistance_Mi_Hvrs & TravelDistance_Mi_New.
Graphing the two methods of calculating TravelDistance_Mi.
First, let’s get create a function to plot the liner model equation.
lm_eqn <- function(df, y, x){
m <- lm(y ~ x, df)
l <- list(a = format(coef(m)[1], digits = 2),
b = format(abs(coef(m)[2]), digits = 2),
s1 = ifelse(test = coef(m)[2] > 0,
yes = "+",
no = "-"
),
r2 = format(summary(m)$r.squared,
digits = 3
)
)
eq <- substitute(italic(y) == a~~s1~~b %.% italic(x)*","~~italic(r)^2~"="~r2,
l
)
as.character(as.expression(eq)
)
}
Investigation of TravelDistance_Mi & TravelDistance_Mi_NewHvrs.
Scatter plot (using a 10% sample to making plotting time faster and to reduce un-needed data in the “same” splot).
set.seed(123456789)
AllDays_NewTravelDist_10Pct <- filter(AllDays_NewTravelDist,
!is.na(TravelDistance_Mi_NewHvrs) &
!is.na(TravelDistance_Mi)
) %>%
rename(DistMethod = TravelDistance_Mi_NewHvrs_Label) %>%
sample_frac(0.1)
TravDist_MiVsCalc <- ggplot(select(AllDays_NewTravelDist_10Pct,
TravelDistance_Mi_NewHvrs,
TravelDistance_Mi,
DistMethod
),
aes(x = TravelDistance_Mi,
y = TravelDistance_Mi_NewHvrs,
colour = DistMethod
)
) +
scale_colour_manual(values = c("red","blue", "green", "orange", "black")
) +
geom_point(shape = 1, alpha = 0.5) +
scale_shape(solid = FALSE) +
geom_smooth(method = "lm", colour = "blue") +
geom_abline(intercept = 0, slope = 1, colour = "red") +
coord_cartesian(xlim = c(0, 1.5), ylim = c(0, 1.5)
) +
scale_x_continuous(breaks = seq(0, 1.5, 0.25)
) +
scale_y_continuous(breaks = seq(0, 1.5, 0.25)
) +
theme(legend.position = "bottom", #c(0.85, 0.40),
legend.text = element_text(size = 6)
) +
annotate(label = lm_eqn(df = AllDays_NewTravelDist_10Pct,
x = AllDays_NewTravelDist_10Pct$TravelDistance_Mi,
y = AllDays_NewTravelDist_10Pct$TravelDistance_Mi_NewHvrs
),
# x = 62,
# y = 20,
x = 0.70,
y = 0.00,
geom = "text",
size = 3,
colour = "blue",
parse = TRUE
) +
annotate(label = "Reference Line (slope = 1)",
# x = 16,
# y = 30,
x = 0.80,
y = 1.05,
geom = "text",
size = 3,
colour = "red"
) +
labs(title = "TravelDistance_Mi vs. TravelDistance_Mi_NewHvrs",
x = "TravelDistance_Mi",
y = "TravelDistance_Mi_NewHvrs"
)
# +
# geom_jitter()
TravDist_MiVsCalc
Investigation of TravelDistance_Mi & TravelDistance_Mi_Hvrs & TravelDistance_Mi_New.
Graphing test with rbokeh.
TravDist_MiVsCalc_Bokeh <- figure(data = select(AllDays_NewTravelDist_10Pct,
TravelDistance_Mi_NewHvrs,
TravelDistance_Mi,
DistMethod
),
xlim = c(0, 1.5),
ylim = c(0, 1.5),
legend_location = "bottom_right"
) %>%
ly_points(x = TravelDistance_Mi,
y = TravelDistance_Mi_NewHvrs,
color = DistMethod,
hover = c(TravelDistance_Mi_NewHvrs, TravelDistance_Mi, DistMethod)
) %>%
ly_abline(a = 0, b = 1, color = "red")
TravDist_MiVsCalc_Bokeh